survey onhpcs languages

12
Survey on High Productivity Computing Systems (HPCS) Languages SALIYA EKANAYAKE 3/11/2013 PART OF QUALIFIER PRESENTATION 1 School of Informatics and Computing Indiana University

Upload: saliya-ekanayake

Post on 12-Apr-2017

294 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Survey onhpcs languages

Survey on High Productivity Computing Systems (HPCS) Languages

SALIYA EKANAYAKE

3/11/2013 PART OF QUALIFIER PRESENTATION 1

School o f In format ics and Comput ingI n d ia n a Un ive rs i ty

Page 2: Survey onhpcs languages

OutlineParallel Programs

Parallel Programming Memory Models

Idioms of Parallel Computing◦ Data Parallel Computation

◦ Data Distribution

◦ Asynchronous Remote Tasks

◦ Nested Parallelism

◦ Remote Transactions

3/11/2013 PART OF QUALIFIER PRESENTATION 2

Page 3: Survey onhpcs languages

Parallel ProgramsSteps in Creating a Parallel Program

3/11/2013 PART OF QUALIFIER PRESENTATION 3

ACU 0

ACU 2

ACU 1

ACU 3

ACU 0

ACU 2

ACU 1

ACU 3

PCU 0

PCU 2

PCU 1

PCU 3

SequentialComputation

……

……

……

……

……

……

……

……

TasksAbstract

ComputingUnits (ACU)

e.g. processes

ParallelProgram

PhysicalComputingUnits (PCU)

e.g. processor, core

Decomposition Assignment Orchestration Mapping

Constructs to Create ACUs◦ Explicit

◦ Java threads, Parallel.Foreach in TPL

◦ Implicit◦ for loops, also do blocks in Fortress

◦ Compiler Directives◦ #pragma omp parallel for in

OpenMP

Page 4: Survey onhpcs languages

Parallel Programming Memory Models

3/11/2013 PART OF QUALIFIER PRESENTATION 4

Task

Shared Global Address Space

...Task Task Task

CPU

Network

Processor

Memory

Processor

CPU CPU

Memory

Processor

CPU CPU

Memory

...

Shared Global Address Space

Task

CPU

TaskTask

Task

Local Address

Space

Task Task Task

Local Address

Space

Local Address

Space

Local Address

Space

...

CPU

Network

Processor

Memory

Processor

CPU CPU

Memory

Processor

CPU CPU

Memory

...Task

CPU

TaskTask

Local Address Space

Local Address

Space

Task

Shared Global

Address Space

... Task Task

Shared Global

Address Space

... Task Task

Shared Global

Address Space

... Task

...

Local Address

Space

Local Address

Space

Task Task Task

Task

...

Task Task

Partitioned Shared Address Space

Local Address

Space

Local Address

Space

Local Address

Space

X XX Y

Z

Array [ ]

Task 1 Task 2 Task 3

Local Address Spaces

Partitioned Shared Address Space

Each task has declared a private variable X

Task 1 has declared another private variable Y

Task 3 has declared a shared variable Z

An array is declared as shared across the shared address space

Every task can access variable Z

Every task can access each element of the array

Only Task 1 can access variable Y

Each copy of X is local to the task declaring it and may not necessarily contain the

same value

Access of elements local to a task in the array is faster than accessing other

elements.

Task 3 may access Z faster than Task 1 and Task 2

Share

d

Distr

ibute

d

Part

itio

ned

Glo

bal Addre

ss S

pace

Hybr

id

Share

d M

emor

y

Imple

men

tation

Distr

ibute

d M

emor

y

Imple

men

tation

Page 5: Survey onhpcs languages

Idioms of Parallel Computing

Common Task

Language

Chapel X10 Fortress

Data parallel computation forall finish … for … async for

Data distribution dmapped DistArray arrays, vectors, matrices

Asynchronous Remote Tasks on … begin at … async spawn … at

Nested parallelism cobegin … forall for … async for … spawn

Remote transactionson … atomic

(not implemented yet)at … atomic at … atomic

3/11/2013 PART OF QUALIFIER PRESENTATION 5

Page 6: Survey onhpcs languages

Data Parallel Computation

3/11/2013 PART OF QUALIFIER PRESENTATION 6

forall (a,b,c) in zip (A,B,C) do

a = b + alpha * c;

forall i in 1 … N do

a(i) = b(i);

[i in 1 … N] a(i) = b(i);

A = B + alpha * C;

writeln(+ reduce [i in 1 .. 10] i**2;)

for (p in A)

A(p) = 2 * A(p);

for ([i] in 1 .. N)

sum += i;

finish for (p in A)

async A(p) = 2 * A(p);

for i <- 1:10 do

A[i] := i end

A:ZZ32[3,3]=[1 2 3;4 5 6;7 8 9]

for (i,j) <- A.indices() do

A[i,j] := i end

for a <- A do

println(a) end

for a <- {[\ZZ32\] 1,3,5,7,9} doprintln(a) end

end

for i <- sequential(1:10) do

A[i] := i end

for a <- sequential({[\ZZ32\] 1,3,10,8,6}) doprintln(a) end

end

Chapel X10 Fortress

Zipper

Arithmetic domain

Short FormsS

tate

men

t Con

text

Exp

ress

ion C

onte

xt

Seq

uen

tial

Para

llel

Array

Number Range

Para

llel

Seq

uen

tial

Array Indices

Array Elements

Number Range

Set

Page 7: Survey onhpcs languages

Data Distribution

3/11/2013 PART OF QUALIFIER PRESENTATION 7

Chapel X10 Fortress

Domain and Array

var D: domain(2) = [1 .. m, 1 .. n];

var A: [D] real;

const D = [1..n, 1..n];const BD = D dmapped Block(boundingBox=D);

var BA: [BD] real;

Box Distribution of Domain

val R = (0..5) * (1..3);

val arr = new Array[Int](R,10);

Region and Array

val blk = Dist.makeBlock((1..9)*(1..9));val data : DistArray[Int]= DistArray.make[Int](blk, ([i,j]:Point(2)) => i*j);

Box Distribution of Array

Intended◦ blocked

◦ blockCyclic

◦ columnMajor

◦ rowMajor

◦ Default

No Working Implementation

Page 8: Survey onhpcs languages

Asynchronous Remote Tasks

3/11/2013 PART OF QUALIFIER PRESENTATION 8

Chapel X10 Fortress

Asynchronous

Remote and Asynchronous

• at (p) async S

migrates the computation to p and spawns a new activity in p to evaluate S and returns control

• async at (p) S

spawns a new activity in current place and returns control while the spawned activity migrates the computation to p and evaluates S there

• async at (p) async S

spawns a new activity in current place and returns control while the spawned activity migrates the computation to p and spawns another activity in p to evaluate S there

begin writeline(“Hello”);

writeline(“Hi”);

on A[i] do begin

A[i] = 2 * A[i]

writeline(“Hello”);

writeline(“Hi”);

{ // activity T

async {S1;} // spawns T1

async {S2;} // spawns T2

}

Asynchronous

Remote and Asynchronous

(v,w) := (exp1,

at a.region(i) do exp2 end)

spawn at a.region(i) do exp end

dov := exp1at a.region(i) do

w := exp2endx := v+w

end

Remote and Asynchronous

Implicit Multiple Threads and Region Shift

Implicit Thread Group and Region Shift

Page 9: Survey onhpcs languages

Nested Parallelism

3/11/2013 PART OF QUALIFIER PRESENTATION 9

Chapel X10 Fortress

Data Parallelism Inside Task Parallelism

cobegin {forall (a,b,c) in (A,B,C) do

a = b + alpha * c;forall (d,e,f) in (D,E,F) do

d = e + beta * f;

}

sync forall (a) in (A) doif (a % 5 ==0) then

begin f(a);else

a = g(a);

Task Parallelism Inside Data Parallelism

finish { async S1; async S2; }

Data Parallelism Inside Task Parallelism

Given a data parallel code in X10 it is possible tospawn new activities inside the body that getsevaluated in parallel. However, in the absence ofa built-in data parallel construct, a scenario thatrequires such nesting may be customimplemented with constructs like finish, for,and async instead of first having to make dataparallel code and embedding task parallelism

Note on Task Parallelism Inside Data Parallelism

T:Thread[\Any\] = spawn do exp endT.wait()

do exp1 also do exp2 end

Explicit Thread

Structural Construct

Data Parallelism Inside Task Parallelism

arr:Array[\ZZ32,ZZ32\]=array[\ZZ32\](4).fill(id)for i <- arr.indices() do

t = spawn do arr[i]:= factorial(i) endt.wait()

end

Note on Task Parallelism Inside Data Parallelism

Page 10: Survey onhpcs languages

Remote Transactions

3/11/2013 PART OF QUALIFIER PRESENTATION 10

X10 Fortress

def pop() : T {var ret : T;when(size>0) {

ret = list.removeAt(0);size --;

}return ret;

}

var n : Int = 0;finish {

async atomic n = n + 1; //(a)async atomic n = n + 2; //(b)

}

var n : Int = 0;finish {

async n = n + 1; //(a) -- BADasync atomic n = n + 2; //(b)

}

Unconditional Local

Conditional Local

val blk = Dist.makeBlock((1..1)*(1..1),0);val data = DistArray.make[Int](blk, ([i,j]:Point(2)) => 0);val pt : Point = [1,1];

finish for (pl in Place.places()) {async{

val dataloc = blk(pt);if (dataloc != pl){

Console.OUT.println("Point " + pt + " is in place " + dataloc);at (dataloc) atomic {

data(pt) = data(pt) + 1;}

}else {

Console.OUT.println("Point " + pt + " is in place " + pl);atomic data(pt) = data(pt) + 2;

}}

}Console.OUT.println("Final value of point " + pt + " is " + data(pt));

Unconditional Remote

The atomicity is weak in the sense that an atomic block appearsatomic only to other atomic blocks running at the same place. Atomiccode running at remote places or non-atomic code running at local orremote places may interfere with local atomic code, if care is nottaken

dox:Z32 := 0y:Z32 := 0z:Z32 := 0atomic do

x += 1y += 1

also atomic doz := x + y

endz

end

Local

f(y:ZZ32):ZZ32=y yD:Array[\ZZ32,ZZ32\]=array[\ZZ32\](4).fill(f)

q:ZZ32=0at D.region(2) atomic do

println("at D.region(2)")q:=D[2]println("q in first atomic: " q)

also at D.region(1) atomic doprintln("at D.region(1)")

q+=1println("q in second atomic: " q)

endprintln("Final q: " q)

Remote (true if distributions were implemented)

Page 11: Survey onhpcs languages

K-Means ImplementationWhy K-Means?◦ Simple to Comprehend

◦ Broad Enough to Exploit Most of the Idioms

Distributed Parallel Implementations◦ Chapel and X10

Parallel Non Distributed Implementation◦ Fortress

Complete Working Code in Appendix of Paper

3/11/2013 PART OF QUALIFIER PRESENTATION 11

Page 12: Survey onhpcs languages

3/11/2013 PART OF QUALIFIER PRESENTATION 12

Thank you!