stream and ra hpc challenge benchmarks - chapel

44

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STREAM and RA HPC Challenge Benchmarks - Chapel
Page 2: STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 2

Page 3: STREAM and RA HPC Challenge Benchmarks - Chapel

Two classes of competition: Class 1: “best performance”

Class 2: “most productive” Judged on: 50% performance 50% elegance

Four recommended benchmarks: STREAM, RA, FFT, HPL

Use of library routines: discouraged

Why you may care: provides an alternative to the top-500’s focus on peak performance

Recent Class 2 Winners:2008: performance: IBM (UPC/X10)

productive: Cray (Chapel), IBM (UPC/X10), Mathworks (Matlab)

2009: performance: IBM (UPC+X10)

elegance: Cray (Chapel)

Chapel: Sample Codes

Page 4: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel: Sample Codes 4

64 77 107 165 176329787

1130

8800

Global STREAM Triad

EP STREAM Triad Random Access FFT HPL

Code Size Summary (Source Lines of Code)

Chapel

Reference

Page 5: STREAM and RA HPC Challenge Benchmarks - Chapel

Benchmark 2008 2009 Improvement

Global STREAM1.73 TB/s(512 nodes)

10.8 TB/s(2048 nodes)

6.2x

EP STREAM1.59 TB/s(256 nodes)

12.2 TB/s(2048 nodes)

7.7x

Global RA0.00112 GUPs(64 nodes)

0.122 GUPs(2048 nodes)

109x

Global FFTsingle-threadedsingle-node

multi-threaded multi-node

multi-node parallel

Global HPLsingle-threaded single-node

multi-threaded single-node

single-node parallel

All timings on ORNL Cray XT4:

• 4 cores/node

• 8 GB/node

• no use of library routines

Chapel: Sample Codes

Page 6: STREAM and RA HPC Challenge Benchmarks - Chapel

const ProblemSpace: domain(1, int(64))

dmapped Block([1..m])

= [1..m];

var A, B, C: [ProblemSpace] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

Chapel: Sample Codes 6

This loop should eventually be written:A = B + alpha * C;

(and can be today, but performance is worse)

Page 7: STREAM and RA HPC Challenge Benchmarks - Chapel

coforall loc in Locales do

on loc {

local {

var A, B, C: [1..m] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

}

}

Chapel: Sample Codes 7

Page 8: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel: Sample Codes 8

0

2000

4000

6000

8000

10000

12000

14000

1 2048

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (Cray XT4)

2008 Chapel Global TPL=1

2008 Chapel Global TPL=2

2008 Chapel Global TPL=3

2008 Chapel Global TPL=4

MPI EP PPN=1

MPI EP PPN=2

MPI EP PPN=3

MPI EP PPN=4

Chapel Global TPL=1

Chapel Global TPL=2

Chapel Global TPL=3

Chapel Global TPL=4

Chapel EP TPL=4

Page 9: STREAM and RA HPC Challenge Benchmarks - Chapel

const TableDist = new dmap(new Block([0..m-1])),

UpdateDist = new dmap(new Block([0..N_U-1]));

const TableSpace: domain … dmapped TableDist = …,

Updates: domain … dmapped UpdateDist = …;

var T: [TableSpace] uint(64);

forall ( ,r) in (Updates,RAStream()) do

on TableDist.idxToLocale(r & indexMask) {

const myR = r;

local T(myR & indexMask) ^= myR;

}

Chapel: Sample Codes 9

This body should eventually simply be written:on T(r&indexMask) do

T(r&indexMask) ^= r;

(and again, can be today, but performance is worse)

Page 10: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel: Sample Codes 10

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2048

GU

P/s

Number of Locales

Performance of HPCC Random Access (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

Page 11: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel: Sample Codes 11

0%

1%

2%

3%

4%

5%

6%

7%

32 64 128 256 512 1024 2048

% E

ffic

ien

cy(o

f sc

ale

d C

hap

el T

PL=

4 lo

cal G

UP/

s)

Number of Locales

Efficiency of HPCC Random Access on 32+ Locales (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

MPI PPN=4

MPI No Buckets PPN=4

MPI+OpenMP TPN=4

Page 12: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel: Sample Codes 12

0

0.004

0.008

0.012

0.016

0.02

1 4

GU

P/s

Number of Locales

Performance of HPCC Random Access (Cray CX1)

0

5

10

15

20

25

30

35

1 4

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (Cray CX1)

0

50

100

150

200

250

300

350

1 64

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (IBM pSeries 575)

0

2

4

6

8

10

12

1 64

GB

/s

Tasks per Locale

Performance of HPCC STREAM Triad (SGI Altix)

Page 13: STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 13

Page 14: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (14)

Adaptive Mesh Refinement in Chapel

What’s so great about domains?

Ability to reason about unions of rectangular index spaces

(as unions of domains)

Trivial shared-memory

parallelism; easy access

to distributed parallelism with

distributions

Fewer nested loops, and no

bounds to mess up

Striding allows much better

description of grids (vertices,

edges, cell centers)

Dimension-independent code

coarse

gridfiner grids

finest grids

Page 15: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (15)

Strided grid description

Page 16: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (16)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

Page 17: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (17)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

Page 18: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (18)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

var horizontal_edges = [1..7 by 2, 0..6 by 2];

Page 19: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (19)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

var horizontal_edges = [1..7 by 2, 0..6 by 2];

var vertices = [0..8 by 2, 0..6 by 2];

Page 20: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (20)

Dimension-free stencils

Tasks:

1. Create an N-dimensional grid.

2. Evaluate the function

on the grid.

3. Approximate the Laplacian,

Page 21: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (21)

1. Create an N-dimensional grid

config param N: int = 2;

const dimensions = [1..N];

Page 22: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (22)

1. Create an N-dimensional gridnum_points

dx

config const num_points = 20;

const dx = 1.0 / (num_points-1);

Page 23: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (23)

1. Create an N-dimensional grid

var grid_points: domain(N);

var ranges: N*range;

for d in dimensions do

ranges(d) = 1..num_points;

grid_points = ranges;

Page 24: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (24)

2. Evaluate the function

var f: [grid_points] real = 1.0;

forall point in points {

for d in dimensions do

f(point) *= sin( (point(d)-1)*dx );

}

Page 25: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (25)

2. Evaluate the function

var f: [grid_points] real = 1.0;

forall point in points {

for d in dimensions do

f(point) *= sin( (point(d)-1)*dx );

}

Calculates real

coordinate xd

Page 26: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (26)

3. Approximate the Laplacian,

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Page 27: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (27)

3. Approximate the Laplacian,

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Laplacian is only defined

on the interior of the grid

Page 28: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (28)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Page 29: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (29)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}Approximates

Page 30: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (30)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

Page 31: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (31)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

Page 32: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (32)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

+shift

Page 33: STREAM and RA HPC Challenge Benchmarks - Chapel

Chapel (33)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

+shift

-shift

Page 34: STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 34

Page 35: STREAM and RA HPC Challenge Benchmarks - Chapel

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

Chapel: Sample Codes 35

97

97

23

6577

85

91

33

76

44

54

57

12

82

61

Page 36: STREAM and RA HPC Challenge Benchmarks - Chapel

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0

Chapel: Sample Codes 36

Page 37: STREAM and RA HPC Challenge Benchmarks - Chapel

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0 maxPathLength = 1

Chapel: Sample Codes 37

Page 38: STREAM and RA HPC Challenge Benchmarks - Chapel

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0 maxPathLength = 1 maxPathLength = 2

Chapel: Sample Codes 38

Page 39: STREAM and RA HPC Challenge Benchmarks - Chapel

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 39

Page 40: STREAM and RA HPC Challenge Benchmarks - Chapel

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 40

Generic Implementation of Graph G

G.Vertices: A domain whose indices represent the vertices• For toroidal graphs, a domain(d), so vertices are d-tuples• For other graphs, a domain(1), so vertices are integers

G.Neighbors: An array over G.Vertices• For toroidal graphs, a fixed-size array over the domain [1..2*d]• For other graphs…

…an associative domain with indices of type index(G.vertices)…a sparse subdomain of G.Vertices

This kernel and the others are generic w.r.t. these decisions!

Page 41: STREAM and RA HPC Challenge Benchmarks - Chapel

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 41

Generic with respect to vertex sets

vertexSet: A type argument specifying how to represent vertex subsets

Requirements:• Parallel iteration• Ability to add members, test for membership

Options:• An associative domain over verticesdomain(index(G.vertices))

• A sparse subdomain of the verticessparse subdomain(G.vertices)

Page 42: STREAM and RA HPC Challenge Benchmarks - Chapel

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 42

The same genericity applies to subgraphs

Page 43: STREAM and RA HPC Challenge Benchmarks - Chapel

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 43

Page 44: STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 44