grappagrappa.io/docs/grappa-overview-jan2014.pdf · grappa: a latency tolerant runtime for...

Grappa: A latency tolerant runtime for large-scale irregular applications

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark OskinUniversity of WashingtonJanuary 21, 2014

�1

A tale of two programmers

�2blupics@flickr

Pat’s problem: traverse an unbalanced tree

• Tree is embedded in a graph

• ~1T edges in graph~1B edges in tree

• Power law, low diameter

�3

How about a big shared-memory machine?

�4

How about special purpose hardware?

�5

How about a commodity cluster?

�6

Grappa

• Goal: Provide irregular application programmers what they want

• Global view programming model

• Good small-message performance

• Tasks, threads, latency tolerance

• Fine-grained synchronization

• Load balancing

�7

Where is Grappa in the stack?

�8

Application

Compiler

Library

Runtime

Hardware

Grappa’s system view

�9

Global data

DRAM DRAM DRAM DRAM

Core Core Core Core

Network

Global Tasks

Each word of memory has a designated home core All accesses to that word run on that core

Main idea: tolerate latency with other work

�10

Task 1

read()

DRAM DRAM DRAM DRAM


�11

Task 1

read()

DRAM DRAM DRAM DRAM

Task 2

Task ~1000


�12

Task 1

Task 2

Task ~1000

read()

DRAM DRAM DRAM DRAM

�13

Outline

• Motivation

• Programming Grappa

• Key components

• Performance

• Other projects

Pat’s problem: traverse an unbalanced tree

• Tree is embedded in a graph

• ~1T edges in graph~1B edges in tree

• Power law, low diameter

�14

A single node, serial starting point

�15

struct Vertex { index_t id; Vertex * first_child; size_t num_children;};

!!!!!!!!!!!!!!!


�16

struct Vertex { index_t id; Vertex * first_child; size_t num_children;}; void search(Vertex * vertex_addr) { Vertex v = *vertex_addr;

Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }}

!!!!!!


�17


Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Vertex * root = create_tree(); search(root); return 0;} !!!!

The standard boilerplate (not quite right)

�18


Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Grappa::init( &argc, &argv );

Vertex * root = create_tree(); search(root); ! Grappa::finalize(); return 0;}

Back to serial in Grappa’s global view

�19


Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { Grappa::init( &argc, &argv ); Grappa::run( []{ Vertex * root = create_tree(); search(root); }); Grappa::finalize(); return 0;}

Back to serial in Grappa’s global view

�20


Vertex * children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ Vertex * root = create_tree(); search(root); }); finalize(); return 0;}

Addressing global memory

�21

struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = *vertex_addr;

GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;}

Accessing global memory

�22

struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);


Global memory with delegates

�23

struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children;}; void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::call(vertex_addr, [=]{return *vertex_addr});


Global memory with delegates

�24

void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr);

GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); finalize(); return 0;} !!!!

Exposing some parallelism

�25


GlobalAddress<Vertex> children = v.first_child; for( int i = 0; i < v.num_children; ++i ) { spawn( [=]{ search(children+i) }); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ finish( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); }); finalize(); return 0;} !!

Exposing more parallelism

�26


GlobalAddress<Vertex> children = v.first_child; forall<unbound,async>( 0, v.num_children, [children](int64_t i) { search(children+i); }} int main( int argc, char * argv[] ) { init( &argc, &argv ); run( []{ finish( []{ GlobalAddress<Vertex> root = create_tree(); search(root); }); }); finalize(); return 0;} !!

Delegation is more than just RDMA

�27

struct Vertex { index_t id; GlobalAddress<Vertex> first_child; size_t num_children; color_t color; }; !GlobalAddress<int> color_counts = global_alloc<int>(NUM_COLORS); !void search(GlobalAddress<Vertex> vertex_addr) { Vertex v = delegate::read(vertex_addr); color_t c = v.color; bool done = delegate::call( color_counts + c, [c](int & count) { if( count == MAX ) { return true; } else { count++; return false; } }); ! if( done ) return; ! GlobalAddress<Vertex> children = v.first_child; forall<unbound,async>( 0, v.num_children, [children](int64_t i) { search(children+i); }}

�28

Outline

• Motivation


• Key components

• Performance

• Other projects

Grappa design

�29

Cluster node 0

Application

Grappa

Infiniband Network Interface

GASNet

Cluster node n

...TaskingDSM

Active Messages/Aggregator

Core Core CoreNetwork

Application

Grappa

Infiniband Network Interface

GASNet

Core

TaskingDSM

Active Messages/Aggregator

Core Core Core Core

GlobalAddress SpaceMemory Memory

Task n+1

User level context switching

�30Core

L1 Cache

Task n

Task queue Ready queue

Stack

Worker 1 status

Suspended workers

Task 1

Stack

Worker 2 status

Task 2

1 cacheline of status,3 cachelines of stack

Main innovation:We keep state smalland prefetch to cover DRAM latency

�31

var

DRAM DRAM DRAM DRAM

var+1var+1var+1

Accessing memory through delegates

Each word of memory has a designated home core All accesses to that word run on that core

Requestor blocks until complete

�32

var

DRAM DRAM DRAM DRAM

var+1

var+1

var+1

Accessing memory through delegates

Since var is private to home core,updates can be applied

var+1var+1var+1

Mitigating low injection rate with aggregation

�33

Wire Msg

Stack

Worker 3

Msg 3

Stack

Worker 2

Msg 2

Stack

Worker 1

Msg 1

Node 0 Node n

2. Serialize Messages using

Prefetching.

Msg 3

Msg 2

Msg 1

Wire Msg

Msg 3

Msg 2

Msg 1 4. Deserialize and Execute Messages.3. Send over

network.

1. Queue Messages in Linked List.

1 16 256 4096 65536 1048576Message size (bytes)

0B

1B

2B

3B

Maximum

ban

dwidth (bytes/second

) Raw MPI

Grappa, raw GASNet

Grappa, aggregated

Sheet 3

�34

Outline

• Motivation


• Key components

• Performance

• Other projects

A snapshot of current performance

�35

• Current implementation: 15K lines Usable, but many optimizations still outstanding

• Three questions: Do Grappa’s components work individually? Do Grappa’s components work together? How do we compare with other systems?

• Ran on AMD Interlagos cluster; 32 2.1GHz cores, 64GB, 40Gb Infiniband Compared with 128-processor Cray XMT1 (500MHz, 128 streams each)

Measuring context switch performance

• Simple: N tasks yield in a loop on a single core We vary N to see how context switch time changes

• Context switching is isolated here: the system is doing nothing else

�36

of system design on data-intensive workloads, particularlylarge-scale graph analysis problems, that are important amongcybersecurity, informatics, and network-understanding work-loads. The BFS benchmark builds a search tree containingparent nodes for each traversed vertex during the search. Whilethis is a relatively simple problem to solve, it exercises therandom-access and fine-grained synchronization capabilitiesof a system as well as being a primitive in many other graphalgorithms. Performance is measured in traversed edges persecond (TEPS), where the number of edges is the edges mak-ing up the generated BFS tree. With some modifications to theXMT reference version of Graph500 BFS, the XMT compilercan be made to recognize and apply a Manhattan loop collapse,exposing enough parallelism to allow it to scale out to 64 nodesfor the problem scales we show. In order to make comparisoneasier, we do not employ algorithmic improvements for any ofthese versions, though there are many [11, 57]; this makes ourresults difficult to compare with published Graph500 results.Grappa can be expected to benefit the same as MPI due todecreased communication.

IntSort This sorting benchmark is taken from the NAS Par-allel Benchmark Suite [9, 44] and is one on which the CrayXMT’s early predecessor once held the world speed record [2].The largest problem size, class D, ranks two billion uniformlydistributed random integers using either a bucket or a count-ing sort algorithm, depending on the strengths of the system.Bucket sort executes a greater number of loops, but is ableto leverage locality and avoid communication completely inthe final phase, ranking within buckets. For these reasons, theMPI reference version and our Grappa implementation usebucket sort. On the other hand, the Cray XMT cannot takeadvantage of locality, but has an efficient compiler-supportedparallel prefix sum, so it performs best using the countingsort algorithm. The performance metrics for NAS ParallelBenchmarks, including IntSort, are “millions of operations persecond” (MOPS). For IntSort, this “operation” is ranking asingle key, so it is roughly comparable to “GUPS” or “TEPS.”

PageRank This is a common centrality metric for graphs.PageRank is an iterative algorithm with a common patternof gather, apply, and scatter on the rank of vertex. The algo-rithm is often implemented by sparse linear algebra libraries,with the main kernel being the sparse matrix dense vectormultiply. For the multiply step, Grappa parallelizes over therows and parallelizes each dot product. PageRank has thefortunate property that the accumulation function over thein-edges is associative and commutative, so they can be pro-cessed in any order or in parallel. Rather than the programmerwriting the parallel dot product as local accumulations witha final all-reduce step, we simply send streaming incrementsto each element of the final vector. We compare PageRankto published results for the Trilinos linear algebra library im-plemented in MPI [48], and multithreaded PageRank for theXMT [10]. For Grappa, we run on a scale 29 graph using the

Graph500 generator.The metric we use is algorithmic time, which means startup

and loading of the data structure (from disk) is not included inthe measurement. Grappa collects statistics about applicationbehavior (packets sent, context switches, etc) and these arediscussed where appropriate.

7. EvaluationThe goal of our evaluation is to understand whether the corepieces of the Grappa runtime system, namely our taskingsystem and the global memory/communication layer, workas expected and whether together they are able to efficientlyrun irregular applications. We evaluate Grappa in three basicsteps:• We present results that show that Grappa can support large

amounts of concurrency, sufficient for remote memory ac-cess and aggregation. The communication layer is able tosustain a very high rate of global memory operations. Wealso show the performance of a graph kernel that stressescommunication and concurrency together.

• We characterize system behavior, including profiling whereexecution time goes, and how aggregation affects messagesize and rates.

• Finally, we show how some more realistic irregular work-loads on Grappa compare to the Cray XMT and hand-tunedMPI code.

7.1. Basic Grappa Performance

User-level context switching Fast context switching is atthe heart of Grappa’s latency tolerance abilities. We assesscontext switch overheads using a simple microbenchmark thatruns a configurable number of workers on a single core, whereeach worker increments values in a large array.

40

80

120

160

0e+00 1e+05 2e+05 3e+05 4e+05Number of workers

Avg

cont

ext s

witc

h la

tenc

y (n

s)

No prefetchingPrefetching

Figure 5: Average context switch time with and without

prefetching.

Figure 5 shows the average context switch time as the num-ber of workers grow. At our standard operating point (⇡1K

7

Context switching is fast

�37

At 1K thread operating point,

~50 ns

500K threads: 75 ns!

This is switching at the bandwidth limit

to DRAM!

Pthreads: 450-800 ns

Measuring random access bandwidth

• Giga updates per second (GUPs) benchmark measures cluster-wide random access bandwidth

• Only one task per core sends messages, so aggregator is essentially isolated

�38

0.00

0.25

0.50

0.75

1.00

8 16 32 48 64Nodes

GUPS

Update type:BlockingDelegated

Random access BW is good

�39

Theoretical peak at 64 nodes is 6.4 GUPs

Minimal context switching

Unbalanced tree search in memory

�40

• Original UTS benchmark designed to exercise work stealing We modified it to include memory access as well

• Creates unbalanced tree in memory and times traversal

• Visiting a vertex requires remote access, so we are context switching, aggregating delegate messages, and work stealing all at the same time

• Metric is vertex throughput

Grappa is able to exploit parallelism in UTS

�41

T1XL tree

0

50

100

150

8 16 32 48 64Nodes

MVe

rts/s

System:GrappaCray XMT1

Grappa tolerates latency in UTS

�42

Summary Overall, Grappa provides a general programmingmodel at a moderate performance cost. While Grappa’s corefunctionality performs well, applications can specialize similarfunctionality for their problems and obtain better performancethan Grappa on the same hardware. This is, however, notthe end of the story for Grappa’s performance; Grappa is ayoung library and, as discussed in the next section, is limitedby its implementation rather than the hardware on which itruns. We believe there are opportunities for optimization thatwill improve its performance.

7.3. Characterization

GUPS BFS IntSort UTS PagerankApp. message rate/core (K/s) 984 983 732 411 474Avg. App. message bytes 31.8 33.9 30.6 47.3 45.5Network BW per node (MB/s) 478 511 343 298 332Avg. Network message bytes 23.2K 4.3K 12.8K 4.2K 3.0KAvg. active tasks/core 0.9 58.2 0.9 326.2 429.3Max. active tasks/core 1 128 1 507 1024Avg. ready queue length/core 2.3 6.1 1.3 186.1 300.4Avg. Ctx switch rate/core (K/s) 34.2 539 127 543 336Steal attempts/core/s 0 0 0 54.4 0Steal successes/core/s 0 0 0 17.0 0

Table 2: Internal runtime metrics, for 64-node, 16-core-per-

node benchmark runs

Runtime metrics Table 2 show a number of internal runtimemetrics collected while executing the benchmarks from theprevious section. These are per-core averages computed overall 1024 cores of the 64-node, 16-core-per-node jobs.

The first group of four metrics relate to the communicationlayer. Application messages refer to those issued by the usercode; network message refer to the aggregated packets sentover the wire. The data show that most application messagesare only a handful of bytes, but our aggregator is able to turnthem into packets of many kilobytes.

The next group of five metrics relate to the scheduler. Weshow the average and maximum number of concurrently-executing tasks, along with the average length of the readyqueue and the average context switch rate. The last two linesshows the rate of work-stealing from other cores. Only UTSdepends on work-stealing for performance; the other applica-tions exploit locality by binding tasks to specific cores Never-theless, even in UTS, steals are an infrequent occurrence andaccount for a small fraction of the execution time.

How much concurrency does Grappa require? How muchlatency does it add? Grappa depends on concurrency tocover the latency of aggregation and remote communication.How much is required for good performance?

Figure 8 shows a 48-node, 16-core-per-node run of UTS,varying the number of concurrently executing tasks on each

●

● ●●

●

●

●

●

●

●

●

●

●

● ●●

●

●

0

50

100

0.00.51.01.52.0

0

5

10

15

Throughput(M

Verts/s)Avg delegate latency (m

s)%

Idle

256 512 768 1024Number of workers

Figure 8: Throughput, request latency, and idleness as num-

ber of concurrent workers is varied

core. The top pane shows the overall throughput of the treesearch. The middle pane shows average blocking delegateoperation latency in microseconds. The bottom pane showsidle time; that is, the fraction of the time the scheduler couldnot find a ready task to run.

We can observe three things from this figure. First, above512 concurrently executing tasks per core, idle time is practi-cally zero: these tasks generate requests fast enough to coverthe latency of aggregation and communication. This matchesthe results seen in the throughput plot; throughput peaks at 512workers and gradually decreases after that due to the overheadof unnecessary context switches. Finally, we see that with 512workers, the average per-request latency is 1.8ms.

Does Grappa scale? Figures 9, 6, and 7 show scaling out to64 nodes. Grappa scales best on BFS and worst on IntSort, butis in general able to make use of more nodes. Unfortunatelymemory limitations in our current network library keep usfrom exploring scaling beyond 64 nodes; this will be addressedin future work.

In the limit, aggregation does not scale: the time it takes toaggregate enough random requests to build a buffer of reason-able size scales with the size of the cluster. However, we be-lieve our current approach will work for clusters with hundredsof nodes. In the future, we will explore hierarchical, collectivetechniques to aggregate requests from multiple nodes as well;we believe this can apply to clusters with thousands of nodes.

What limits Grappa’s performance? The most commonoperation in Grappa is sending a message. Three key opera-tions occur in a message’s lifetime: creating and enqueuing amessage, serializing a message in the aggregator, and deserial-izing a message at the destination. We benchmarked each ofthese steps individually to shed light on the mechanism.

Minimum-sized messages can be created and enqueuedat a rate of 16M/s, serialized into an aggregation buffer at

10

With 512 active tasks per core,

throughput peaks and idle time is practically zero.

(48 nodes, T1XL tree)

per core

Comparing application performance

�43

• Additional benchmarks: Breadth-first search: Simple version of Graph500 benchmark Integer sort: NAS Parallel Benchmark bucket/counting sort PageRank: Google’s web graph centrality metric

• For these three, MPI and XMT currently beat Grappa, but

• Price/performance is still better than XMT

• Grappa code is shorter and simpler than MPI

• There is a cost to Grappa’s generality(but we’re working on reducing that cost!)

Comparing GUPS performance

�44

Grappa XMT MPI

GUPS 1 2.23 0.11

UTS (T1)

BFS

IntSort

Pagerank

• XMT version: basic GUPS with hardware fetch-and-add

• MPI version: HPCC RandomAccess

• Contains specialized implementation of aggregation, but

• not optimized for out-of-cache

• limited support for concurrent communication

Performance normalized to Grappa, 64 nodes

Comparing UTS-in-memory performance

�45

Grappa XMT MPI

GUPS 1 2.23 0.11

UTS (T1) 1 0.38

BFS

IntSort

Pagerank

• XMT implementation uses fine-grained synchronization at each vertex, which is unnecessary

• Difficult to avoid; baked into compiler, OS, hardware, etc.

• Grappa supports this too, but also allows coarse-grained synchronization

• Grappa also takes advantage of locality in spawns and edge lists


Comparing BFS performance

�46

Grappa XMT MPI

GUPS 1 2.23 0.11

UTS (T1) 1 0.38

BFS 1 1.63 3.52

IntSort

Pagerank

• BFS_Simple from Graph500(no Beamer-like optimizations)

• MPI version includes specialized implementation of aggregation, moving only two 8-byte vertex IDs

• Grappa runs additional code, requires more space to support general aggregation

• BFS messages include additional 16 bytes of deserialization/synchronization informationPerformance normalized to Grappa, 64 nodes

Comparing IntSort performance

�47

Grappa XMT MPI

GUPS 1 2.23 0.11

UTS (T1) 1 0.38

BFS 1 1.63 3.52

IntSort 1 3.59 5.36

Pagerank

• NAS Parallel Benchmarks, class D

• Grappa writes keys directly to destination with delegates

• MPI version implements specialized aggregation with local sort plus collective communication with Alltoallv

• Can we implement aggregation with collectives in Grappa?


Comparing Pagerank performance

�48

Grappa XMT MPI

GUPS 1 2.23 0.11

UTS (T1) 1 0.38

BFS 1 1.63 3.52

IntSort 1 3.59 5.36

Pagerank 1 4.35 4.87

• Grappa version is a straightforward nested loop

• MPI version uses optimized Trilinos sparse matrix library


Where are we?

�49

• Fundamentals are strong: Context switching is fast Aggregation is fast UTS composes them and gets good performance

• There is currently a cost to our generality

• MPI is mostly beating us for now, often replicating what we’re doing in a specialized way, as well as using tricks we may be able to take advantage of

• We are working on our next-generation networking layer now

�50

Outline

• Motivation


• Key components

• Performance

• Other projects

Brandon Holt – Quals – 7 Nov 2013

Compiler support: automatic delegatesFor where your data is, there your code will be also.

23

Best performance comes from executing task where the data is

– delegate operations execute some small region of code atomically on the core that owns the memory

– generic delegate::call() executes the enclosed region of the task (expressed in a lambda) remotely

Delegated regions can be inferred automatically – inspect uses of Grappa global pointers, find regions that

use global pointers on a particular core, and extract into a delegate

– Implementing with a custom LLVM pass

Additional ideas – pass continuation to allow delegates to hop between

multiple cores before returning to the original caller – specialize multiple versions of delegate regions and

select the best one dynamically based on runtime values

int main(int argc, char* argv[]) { init(&argc, &argv); run([]{ long global* A = global_alloc<long>(Asize); long global* B = global_alloc<long>(Bsize); forall(B, Bsize, [=](double& b){ delegate::call((A+b).core(), [A,b]{ long * Ab = (A+b).pointer(); (*Ab) %= b; }); }); forall(B, Bsize, [=](double& b){ A[b] %= b; }); }); finalize();}

Automatic delegate

Manual delegate

Brandon Holt – Quals – 7 Nov 2013

”Schrödinger” Consistencyuntil observed, operations can be committed and not

23

Delay synchronization as long as possible – commit when operation would be able to observe order – example: pushes kept local, pops search for an available push!

Abstract data structure semantics – express how operations affect and observe abstract state

– abstract locks allow commutative ops to proceed in parallel, and block conflicting ops to run late

– inverse operations annihilate locally, needing no synchronization

– Similar in spirit to “Transactional Boosting”

Maurice Herlihy & Eric Koskinen. PPoPP 2008.Transactional Boosting: A Methodology for Highly-Concurrent Transactional Objects.

?

Q1(yr) :-‐ R( journal, “type”, “Journal” ), R( journal, “title”, “Nature” ), R( journal, “issued”, yr )

Grappaparallel shared memory code

Compiling queries for parallel shared memory platforms

Conclusion

• Extreme latency tolerance helps us scale irregular applications

• Grappa’s runtime system provides - a task library - a distributed shared memory system - a network aggregator

• Context switches are fast, even with many threads Random access bandwidth is goodAggregation is effective

�54

�55

Questions?

1

grappagrappa.io/docs/grappa-overview-jan2014.pdf · grappa: a latency tolerant runtime for...

Documents