hierarchical locality and parallel programming in the extreme...

Hierarchical Locality and Parallel Programming in the Extreme Scale Era

Tarek El-GhazawiThe George Washington University

University of Southern CaliforniaSeptember 29, 2016

Tarek El-Ghazawi, GWU September 29, 2016

Overview

2

Fundamental Challenges for Extreme Computing

Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality

Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks

Top Ten Challenges for Exascale: Areas where Research and advances are needed!

1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology4. Scalable System Software5. Programming Systems6. Data Management7. Exascale Algorithms8. Algorithms for Discovery, Design

& Decision9. Resilience and Correctness10. Scientific Productivity

DoE ASCAC Subcommittee ReportFeb 2014

Data movement and/or programming related

Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale

Locality and data movement matter a lot, cost (energy and time) rapidly increases with distance

Locality and data movement are critical even at short distance, more so at far distances

Bandwidth density vs. system distance Energy vs. system distance

[Source: ASCAC 14]

Technological Challenges : (2) Bandwidth

Interconnect is not keeping up with the growth in compute capability Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond

Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP Huge bandwidth density (GB/s/μm) needed on-chip due to large #cores in small area

Ref: Miller, D. A, Proceedings of the IEEE, 2009.

Growing manycore bandwidth requirements Widening gap between available I/O and compute capability

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5

Byt

es/F

LOP

Year

Xeon Phi (Knights Corner)

Xeon Phi(Knights Landing)

NVIDIA K20

K40 K80


Overview

6

Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality


Tarek El-Ghazawi, GWU

Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems

7



8



9



10



11

Cray XC40



12

Cray XC40

TTT TILE64

Tile64


Overview

13




Where are Programming Models from That?

14

What is a programming model? An abstract virtual machine A view of data and execution The logical interface between architecture and applications

Why Programming Models? Decouple applications and architectures

Write applications that run effectively across architectures Design new architectures that can effectively support legacy

applications

Programming Model Design Considerations Expose modern architectural features to exploit machine power

and improve performance Maintain Ease of Use Two previous points mean increase productivity!


Current Programming Models and Locality Awareness

15

Process/Thread

Address Space

…

Partitioned Global Address Space

Locality Awareness-One-Sided

Communication-Examples UPC and

Chapel

…

Shared Memory

Locality Awareness-One-Sided

Communication-Example OpenMP

×

…Message Passing

Locality Awareness-Two-Sided

Communication-Example MPI

16

PGAS Languages Include UPC, Chapel and X10


Overview

17



1.42

1.42

1.42

4.2

4.2

4.53 4.53

1736.8

0

2

4

6

8

10

12

14

Time (ns)

Type of access

Network TimeAddress TranslationAddress IncrementationMemory Access

5.25

GB

/s

734

MB

/s

4.25

MB

/s

Memory Accesses in UPC- Shared Address Translation Overheads

Measurement of the address space overheads Set of micro-benchmarks

measuring the different aspects separately:

0%10%20%

30%40%50%60%70%80%90%100%

% time in m

emory access

Type of access

Shared

Thread 0 Thread1 Thread(Threads -1)

Private 0 Private 1 Private THREADS-1


Memory Access Costs in Chapel

19

Tested shared address access costs in Chapel: Used Chapel Syntax to test

Local part of a distributed object, un-optimized- Accessing local data without saying local

Local Optimized – local part hand-optimized by saying “local”

Local and Non-Distributed

Compiler optimization -> 2x faster Both compiler and hand

optimization -> 70x faster Compiler optimization affects

remote accesses as well Both UPC and Chapel require “

unproductive!” hand tuning to improve local shared accesses


Fast Address Translation for PGAS

20

Software solutions Hand tweaking – Non-productive Compiler optimizations - reduced arithmetic for some

straightforward cases Look up tables, full and reduced- Take memory! ICPP05 TLB's ....

Hardware solutions Create hardware that understands how to traverse the

PGAS memory model and support basic costly needs Avail it through instructions and leverage them by the

compiler Some work for UPC, little for Chapel

Hardware Support for PGAS Example Operations for Support in Hardware

Shared address incrementing Load/store to/from a PGAS shared address

Address translation support: convert a shared address to a system virtual address used to perform the access

Locality tests for remote data Can be used to tell whether to call the network subroutines, by e.g. testing

the affinity field in a work sharing construct

Availed as ISA extension New instructions used directly by compiler Current hardware support and instructions only support

address mapping Future support for remote data accesses and various

types of synchronizations are of interest


Hardware/Software Co-Design Platform in a Nutshell

22

First prototype in FPGAs, supports small core count and apps Second is primarily software, supports bigger core counts and

codes

GasNet

BUPC

Benchmarking Kernels

Gem5

Ported on top of Gem5

New Instructions Inserted into Code Gen

UPC Code Out of the Box

A Runtime System that recognizes and enforces the developed mapping

Ported on top of Leon3 GasNet

BUPC

Benchmarking Kernels

Virtex-6 FPGA

Extended with proposed PGAS hardware support

for shared addressingLeon3 Cores

WorkstationCluster - Future


PGAS Hardware Support Overview

23

shared [4] int arrayA[32];arrayA[10] = 5;

0 1 2 316 17 18 19

4 5 6 720 21 22 23

8 9 10 1124 25 26 27

12 13 14 1528 29 30 31

Thread 0

Thread 1

Thread 2

Thread 3

arrayA

Address Incrementation

Th=0 Ph=0 Va=0x3f10

Th=2 Ph=2 Va=0x3f18

Address Translation/Store

0xfff01203f14

pgas_inc_{x}

pgas_st_{x}

Regular pointer representation

Shared Pointer Representation

Early Results- NPB Kernels with HW SupportGem5 Alpha 21264


Overview

25




Possible Solutions for Hierarchical Locality Exploitation

26

Rewrite your code with low-level tricks to target the underlying hierarchical architecture? Great performance, but not productive & non-portable

Extend programming models with hierarchical syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!) Portable but not productive


Productive Division of Responsibilities: The Programmer and the System

27

Programmer Use a locality-aware programming paradigm such as

MPI or a PGAS language Let programmer worry about the first-order locality,

thread-data affinity System

Understand your system hierarchy, costs associated with data movements across levels

Understand the program characteristics Derive locality exploitation on level-by-level basis via

Hierarchical Thread Grouping/partitioning


Motivations and Early Investigations

28

Synthetic benchmark showing the gain of properwith varying number of threads and percentage ofremote communication

Proper placement will Avoid unnecessary data

movement by exploiting locality

Utilize the shared memory and caches in the neighborhood

Utilize the best interconnect for the underlying communication

Yield a rising benefit as the size of your system increases! A must for exascale!!

24

96

384

1008

0

5

10

15

20

25

30

35

40

10 20 30 40 50 60 70 80 90

Speedu

p from

Prope

r Placemen

t

Remote Comm (%)

Effect of Exploiting Hier Locality (Read Access)

35‐40

30‐35

25‐30

20‐25

15‐20

10‐15

5‐10

0‐5


Motivations and Early Investigations

29

The response of each level to communication varies according to message sizes Closer is not always faster

Know and characterize your architecture!!

0

1

2

3

4

5

6

8 B

16 B

32 B

64 B

128 B

256 B

512 B 1 K

2 K

4 K

8 K

16 K

32 K

64 K

128 K

256 K

512 K

1 M

2 M

Band

width (G

B/s)

Message Size

Put/Write Bandwidth – Cray XE6m

SelfSame DieSame ChipSame NodeRemote


1

PHLAME Methodology(Parallel Hierarchical Abstraction Model of Execution)

30

Program

ApplicationCommunication

profileInstrumented

Program

CommunicationBenchmarks

PHLAME Description File

PlacementPlacement Algorithm

Target Machine

1. Characterize the machine message costs at each level to generate PHLAME description File (PDF)

2. Profile the application communication3. Build a placement layout for the threads based on the above4. Run the application with the layout built in the previous step

2

34


Characterizing the target machine

31

Message cost: total time for message to be delivered

Msg(bytes)Level 1 8 16 32 64 128 …

1 0.516956 0.665469 1.209482 1.986097 3.606203 7.593014

2 0.688468 1.038422 1.54703 2.772387 5.138746 10.86957

3 0.687853 1.033378 1.543448 2.770083 5.128205 10.85776

4 0.706414 1.05042 1.548707 2.77855 5.128205 11.02536

Example: time per message (ns) machine communication characterization

1


Characterizing the application communication

32

Instrument the application code generate the communication

activity matrices

The message sizes range is partitioned into bins Each bin corresponds to a sub range,

example: 164, 64128, …

There are two communication activity matrices for each bin Average size Number of messages

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

Msg<64 64≤Msg<128 128≤Msg<256

AvgMsgSize

Numof

Msgs

Data Affinity Thread

Initi

atin

g Th

read

…

2


Calculating Level Costs

33

Placement decisions require a measurement of how threads fit together

Repeat for each level : For each pair of threads ( , ),

where , calculate the cost of their communication

Where B is the number of bins

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

MsgLevel 8 16 32 64 128 256 …

Die 0.516956 0.665469 1.209482 1.986097 3.606203 7.593014

Chip 0.688468 1.038422 1.54703 2.772387 5.138746 10.86957

Node 0.687853 1.033378 1.543448 2.770083 5.128205 10.85776

Remote 0.706414 1.05042 1.548707 2.77855 5.128205 11.02536

Msg<64 64≤Msg<256 256≤Msg<512

AvgMsgSize

Num of Msgs

Msgscosts

0 1 2 3

0

1

2

3

⊙ ⊙ ⊙

Level costs

=

…

…

Bins …3

Tarek El-Ghazawi, GWU 34

, ,

Hierarchical Thread Fitness Measure

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

The fit measure shows how two threads benefit or lose if scheduled on a given level

The fit measure is based on the difference of message costs at each level

Cost

placementat , given

,

CPU

Node

Blade

Chassis

Threads

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3


, ,


0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3



placementat , given

,

CPU

Node

Blade

Chassis

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3


0 1 2 3

0

1

2

3

, ,


0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3



placementat , given

,

CPU

Node

Blade

Chassis


, ,


0 1 2 3

0

1

2

3

0 1 2 3

0

1

2

3



placementat , given

,

0 1 2 3

0

1

2

3

CPU

Node

Blade

Chassis

0 1 2 3

0

1

2

3


Mapping to Graph Theory

39

The application communication pattern can be mapped into a graph Vertices represent the threads

Edges represent interactions between threads

HTF at each level are edge weights Multiple weights per edge

i j

wij1, wij2, wij3, … wijL


Hierarchical Graph Partitioning

40

Algorithms can be Bottom Up

Form partitions at lower levels first and recursively group them at higher levels

Top Down Form partitions at upper

levels first and recursively break them at lower levels

Abstract Machine:Level1: Width = 4 (# of locales)

MaxLocaleSize = 4 (# of cores in each locale)


Testbed

41

Cray XE6m/XK7m 24 cores per node

Two 12-core AMD Magny Cours Gemini Interconnect

2D Torus

UPC NPB Benchmark from GWU IS – Class C FT – Class C CG – Class C MG – Class C EP – Class C

Heat Diffusion


Profiling the application communication – Implementation

42

TAU is selected to profile UPC and MPI programs Generates activity matrix for each bin

Bins are not supported in TAU profiles

Modifications were made to TAU backend and frontends to support bins


Customizing GASNet

43

The clustering algorithm usually assigns unequal number of threads to different nodes

The Cray Application Level Placement Scheduler (ALPS) does not support this feature

A modified GASNet Geminie Conduit was used to trick the system to achieve the non-uniform thread count per node Dummy processes are launched Environment variables control how the

runtime pick the correct number of processes on each node

3 8 1 10 2 5 11 12 6 7 0 9

8 3

1

2

5

6 9

07

GASNET_THREAD_MAP

GASNET_NUM_THREADS 10


Experimental Results

44

FT – all-to-all communication

0

2

4

6

8

10

12

64 128 256 512 1024

Gain (%

)

Number of Threads

0

0.2

0.4

0.6

0.8

1

64 128 256 512 1024Relative Co

mmun

ication Overheard

Number of Threads

ClusteringSplittingSplitting ‐ Non RestrictedPHAST

Default


Experimental ResultsMPI

45

FT – all-to-all communication

012345678

64 128 256 512 1024

Gain (%

)

Number of Threads


0

0.2

0.4

0.6

0.8

1

64 128 256 512 1024Number of Threads

Default

Relativ

e Co

mmun

icationOverhead


Experimental ResultsUPC

46

CG – Irregular memory access and communication

0

20

40

60

80

100

64 128 256 512 1024

Gain (%

)

Number of Threads


0

0.2

0.4

0.6

0.8

1


Default

Relativ

e Co

mmun

icationOverhead


Experimental ResultsMPI

47

CG – Irregular memory access and communication

051015202530354045

64 128 256 512 1024

Gain (%

)

Number of Threads


0

0.2

0.4

0.6

0.8

1


Default

Relativ

e Co

mmun

icationOverhead


CG – Non Restricted Explanation

48

Node 0

Node 1

Remote

Remote


Overview

49




Concluding Remarks

50

Due to energy and bandwidth constrains data movements are becoming too expensive

Locality exploitation is an obvious target Extreme scale architectures are becoming deeply

hierarchical giving rise to hierarchical locality Hierarchical locality exploitation must be done

productively, leaving programmers with the necessary min work to do

We can expect some programming paradigms to provide explicit solutions

Locality-aware programming, hardware support and run-time systems can play a bigger role while keeping programmers productivity


Publications

51

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel Hamid Badawy, and Tarek El-Ghazawi, “Exploiting Hierarchical Locality in Deep Parallel Architectures”. ACM Transactions on Architecture and Code Optimizations. Volume 13 Issue 2, June 2016 .

Olivier Serres, Abdullah Kayi, Ahmed Anbar and Tarek El-Ghazawi, “Enabling PGASProductivity with Hardware Support for Shared Address Mapping: A UPC Case Study”. ACMTransactions on Architecture and Code Optimizations. Volume 12 Issue 4, January 2016.

Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres and Tarek El-Ghazawi, “Where ShouldThe Threads Go? Leveraging Hierarchical Data Locality to Solve the Thread AffinityDilemma,” in Proc. 20th International Conference on Parallel and Distributed Systems(ICPADS 2014). IEEE, Hsinchu, Taiwan, Dec 16-19, 2014.

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy , Tarek El-GhazawiPHLAME: Hierarchical Locality Exploitation Using the PGAS Model. IEEE InternationalConference on Partitioned Global Address Space Programming Models (PGAS 2015),Washington DC, September 18-20, 2015.

3. Olivier Serres, Abdullah Kayi, Ahmad Anbar, and Tarek El-Ghazawi, “Enabling PGASproductivity with hardware support for shared address mapping; a UPC case study,” in Proc.16th IEEE International Conference on High Performance Computing and Communications,August 20-22, 2014.


Follow up work in Hierarchical Locality Exploitation

52

Use thread data-affinity from locality-aware program as a starting point into a hierarchical locality exploitation system (PHLAME or FLAME: Parallel Hierarchical Abstraction Model of Execution)

Examine best graph partitioning methods Decentralize algorithms, and build in fast predictions to handle the

Exascale Consider dynamic solutions Consider unprofiled cases and collecting intelligence on runs for later

use and optimizations Consider data dependent cases Consider dynamic parallelism cases Investigate hardware support

hierarchical locality and parallel programming in the extreme...

Documents