hierarchical locality and parallel programming in the extreme...
TRANSCRIPT
Hierarchical Locality and Parallel Programming in the Extreme Scale Era
Tarek El-GhazawiThe George Washington University
University of Southern CaliforniaSeptember 29, 2016
Tarek El-Ghazawi, GWU September 29, 2016
Overview
2
Fundamental Challenges for Extreme Computing
Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
Top Ten Challenges for Exascale: Areas where Research and advances are needed!
1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology4. Scalable System Software5. Programming Systems6. Data Management7. Exascale Algorithms8. Algorithms for Discovery, Design
& Decision9. Resilience and Correctness10. Scientific Productivity
DoE ASCAC Subcommittee ReportFeb 2014
Data movement and/or programming related
Technological Challenges: Combined Bandwidth and Energy Challenges for Exascale
Locality and data movement matter a lot, cost (energy and time) rapidly increases with distance
Locality and data movement are critical even at short distance, more so at far distances
Bandwidth density vs. system distance Energy vs. system distance
[Source: ASCAC 14]
Technological Challenges : (2) Bandwidth
Interconnect is not keeping up with the growth in compute capability Many apps require 1 Byte/FLOP off-chip, not possible in 10 TFLOPs chips and beyond
Intel Knights Landing: 500 GB/s => 1/6 Byte/FLOP Huge bandwidth density (GB/s/μm) needed on-chip due to large #cores in small area
Ref: Miller, D. A, Proceedings of the IEEE, 2009.
Growing manycore bandwidth requirements Widening gap between available I/O and compute capability
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
2011.5 2012 2012.5 2013 2013.5 2014 2014.5 2015 2015.5
Byt
es/F
LOP
Year
Xeon Phi (Knights Corner)
Xeon Phi(Knights Landing)
NVIDIA K20
K40 K80
Tarek El-Ghazawi, GWU September 29, 2016
Overview
6
Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
7
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
8
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
9
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
10
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
11
Cray XC40
Tarek El-Ghazawi, GWU
Architectural Challenges: Architectures are becoming Deeply Hierarchical in Extreme Scale – Chips and Systems
12
Cray XC40
TTT TILE64
Tile64
Tarek El-Ghazawi, GWU September 29, 2016
Overview
13
Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
Where are Programming Models from That?
14
What is a programming model? An abstract virtual machine A view of data and execution The logical interface between architecture and applications
Why Programming Models? Decouple applications and architectures
Write applications that run effectively across architectures Design new architectures that can effectively support legacy
applications
Programming Model Design Considerations Expose modern architectural features to exploit machine power
and improve performance Maintain Ease of Use Two previous points mean increase productivity!
Tarek El-Ghazawi, GWU
Current Programming Models and Locality Awareness
15
Process/Thread
Address Space
…
Partitioned Global Address Space
Locality Awareness-One-Sided
Communication-Examples UPC and
Chapel
…
Shared Memory
Locality Awareness-One-Sided
Communication-Example OpenMP
×
…Message Passing
Locality Awareness-Two-Sided
Communication-Example MPI
16
PGAS Languages Include UPC, Chapel and X10
Tarek El-Ghazawi, GWU September 29, 2016
Overview
17
Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
1.42
1.42
1.42
4.2
4.2
4.53 4.53
1736.8
0
2
4
6
8
10
12
14
Time (ns)
Type of access
Network TimeAddress TranslationAddress IncrementationMemory Access
5.25
GB
/s
734
MB
/s
4.25
MB
/s
Memory Accesses in UPC- Shared Address Translation Overheads
Measurement of the address space overheads Set of micro-benchmarks
measuring the different aspects separately:
0%10%20%
30%40%50%60%70%80%90%100%
% time in m
emory access
Type of access
Shared
Thread 0 Thread1 Thread(Threads -1)
Private 0 Private 1 Private THREADS-1
Tarek El-Ghazawi, GWU
Memory Access Costs in Chapel
19
Tested shared address access costs in Chapel: Used Chapel Syntax to test
Local part of a distributed object, un-optimized- Accessing local data without saying local
Local Optimized – local part hand-optimized by saying “local”
Local and Non-Distributed
Compiler optimization -> 2x faster Both compiler and hand
optimization -> 70x faster Compiler optimization affects
remote accesses as well Both UPC and Chapel require “
unproductive!” hand tuning to improve local shared accesses
Tarek El-Ghazawi, GWU
Fast Address Translation for PGAS
20
Software solutions Hand tweaking – Non-productive Compiler optimizations - reduced arithmetic for some
straightforward cases Look up tables, full and reduced- Take memory! ICPP05 TLB's ....
Hardware solutions Create hardware that understands how to traverse the
PGAS memory model and support basic costly needs Avail it through instructions and leverage them by the
compiler Some work for UPC, little for Chapel
Hardware Support for PGAS Example Operations for Support in Hardware
Shared address incrementing Load/store to/from a PGAS shared address
Address translation support: convert a shared address to a system virtual address used to perform the access
Locality tests for remote data Can be used to tell whether to call the network subroutines, by e.g. testing
the affinity field in a work sharing construct
Availed as ISA extension New instructions used directly by compiler Current hardware support and instructions only support
address mapping Future support for remote data accesses and various
types of synchronizations are of interest
Tarek El-Ghazawi, GWU
Hardware/Software Co-Design Platform in a Nutshell
22
First prototype in FPGAs, supports small core count and apps Second is primarily software, supports bigger core counts and
codes
GasNet
BUPC
Benchmarking Kernels
Gem5
Ported on top of Gem5
New Instructions Inserted into Code Gen
UPC Code Out of the Box
A Runtime System that recognizes and enforces the developed mapping
Ported on top of Leon3 GasNet
BUPC
Benchmarking Kernels
Virtex-6 FPGA
Extended with proposed PGAS hardware support
for shared addressingLeon3 Cores
WorkstationCluster - Future
Tarek El-Ghazawi, GWU
PGAS Hardware Support Overview
23
shared [4] int arrayA[32];arrayA[10] = 5;
0 1 2 316 17 18 19
4 5 6 720 21 22 23
8 9 10 1124 25 26 27
12 13 14 1528 29 30 31
Thread 0
Thread 1
Thread 2
Thread 3
arrayA
Address Incrementation
Th=0 Ph=0 Va=0x3f10
Th=2 Ph=2 Va=0x3f18
Address Translation/Store
0xfff01203f14
pgas_inc_{x}
pgas_st_{x}
Regular pointer representation
Shared Pointer Representation
Early Results- NPB Kernels with HW SupportGem5 Alpha 21264
Tarek El-Ghazawi, GWU September 29, 2016
Overview
25
Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
Possible Solutions for Hierarchical Locality Exploitation
26
Rewrite your code with low-level tricks to target the underlying hierarchical architecture? Great performance, but not productive & non-portable
Extend programming models with hierarchical syntax and semantics and ask programmers to worry about all of those hardware details? (make them hierarchical-locality-aware!) Portable but not productive
Tarek El-Ghazawi, GWU
Productive Division of Responsibilities: The Programmer and the System
27
Programmer Use a locality-aware programming paradigm such as
MPI or a PGAS language Let programmer worry about the first-order locality,
thread-data affinity System
Understand your system hierarchy, costs associated with data movements across levels
Understand the program characteristics Derive locality exploitation on level-by-level basis via
Hierarchical Thread Grouping/partitioning
Tarek El-Ghazawi, GWU
Motivations and Early Investigations
28
Synthetic benchmark showing the gain of properwith varying number of threads and percentage ofremote communication
Proper placement will Avoid unnecessary data
movement by exploiting locality
Utilize the shared memory and caches in the neighborhood
Utilize the best interconnect for the underlying communication
Yield a rising benefit as the size of your system increases! A must for exascale!!
24
96
384
1008
0
5
10
15
20
25
30
35
40
10 20 30 40 50 60 70 80 90
Speedu
p from
Prope
r Placemen
t
Remote Comm (%)
Effect of Exploiting Hier Locality (Read Access)
35‐40
30‐35
25‐30
20‐25
15‐20
10‐15
5‐10
0‐5
Tarek El-Ghazawi, GWU
Motivations and Early Investigations
29
The response of each level to communication varies according to message sizes Closer is not always faster
Know and characterize your architecture!!
0
1
2
3
4
5
6
8 B
16 B
32 B
64 B
128 B
256 B
512 B 1 K
2 K
4 K
8 K
16 K
32 K
64 K
128 K
256 K
512 K
1 M
2 M
Band
width (G
B/s)
Message Size
Put/Write Bandwidth – Cray XE6m
SelfSame DieSame ChipSame NodeRemote
Tarek El-Ghazawi, GWU
1
PHLAME Methodology(Parallel Hierarchical Abstraction Model of Execution)
30
Program
ApplicationCommunication
profileInstrumented
Program
CommunicationBenchmarks
PHLAME Description File
PlacementPlacement Algorithm
Target Machine
1. Characterize the machine message costs at each level to generate PHLAME description File (PDF)
2. Profile the application communication3. Build a placement layout for the threads based on the above4. Run the application with the layout built in the previous step
2
34
Tarek El-Ghazawi, GWU
Characterizing the target machine
31
Message cost: total time for message to be delivered
Msg(bytes)Level 1 8 16 32 64 128 …
1 0.516956 0.665469 1.209482 1.986097 3.606203 7.593014
2 0.688468 1.038422 1.54703 2.772387 5.138746 10.86957
3 0.687853 1.033378 1.543448 2.770083 5.128205 10.85776
4 0.706414 1.05042 1.548707 2.77855 5.128205 11.02536
Example: time per message (ns) machine communication characterization
1
Tarek El-Ghazawi, GWU
Characterizing the application communication
32
Instrument the application code generate the communication
activity matrices
The message sizes range is partitioned into bins Each bin corresponds to a sub range,
example: 164, 64128, …
There are two communication activity matrices for each bin Average size Number of messages
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
Msg<64 64≤Msg<128 128≤Msg<256
AvgMsgSize
Numof
Msgs
Data Affinity Thread
Initi
atin
g Th
read
…
2
Tarek El-Ghazawi, GWU
Calculating Level Costs
33
Placement decisions require a measurement of how threads fit together
Repeat for each level : For each pair of threads ( , ),
where , calculate the cost of their communication
Where B is the number of bins
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
MsgLevel 8 16 32 64 128 256 …
Die 0.516956 0.665469 1.209482 1.986097 3.606203 7.593014
Chip 0.688468 1.038422 1.54703 2.772387 5.138746 10.86957
Node 0.687853 1.033378 1.543448 2.770083 5.128205 10.85776
Remote 0.706414 1.05042 1.548707 2.77855 5.128205 11.02536
Msg<64 64≤Msg<256 256≤Msg<512
AvgMsgSize
Num of Msgs
Msgscosts
0 1 2 3
0
1
2
3
⊙ ⊙ ⊙
Level costs
=
…
…
Bins …3
Tarek El-Ghazawi, GWU 34
, ,
Hierarchical Thread Fitness Measure
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
The fit measure shows how two threads benefit or lose if scheduled on a given level
The fit measure is based on the difference of message costs at each level
Cost
placementat , given
,
CPU
Node
Blade
Chassis
Threads
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
Tarek El-Ghazawi, GWU 35
, ,
Hierarchical Thread Fitness Measure
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
The fit measure shows how two threads benefit or lose if scheduled on a given level
The fit measure is based on the difference of message costs at each level
placementat , given
,
CPU
Node
Blade
Chassis
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
Tarek El-Ghazawi, GWU 36
0 1 2 3
0
1
2
3
, ,
Hierarchical Thread Fitness Measure
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
The fit measure shows how two threads benefit or lose if scheduled on a given level
The fit measure is based on the difference of message costs at each level
placementat , given
,
CPU
Node
Blade
Chassis
Tarek El-Ghazawi, GWU 37
, ,
Hierarchical Thread Fitness Measure
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
The fit measure shows how two threads benefit or lose if scheduled on a given level
The fit measure is based on the difference of message costs at each level
placementat , given
,
0 1 2 3
0
1
2
3
CPU
Node
Blade
Chassis
0 1 2 3
0
1
2
3
Tarek El-Ghazawi, GWU 38
, ,
Hierarchical Thread Fitness Measure
0 1 2 3
0
1
2
3
0 1 2 3
0
1
2
3
The fit measure shows how two threads benefit or lose if scheduled on a given level
The fit measure is based on the difference of message costs at each level
placementat , given
,
0 1 2 3
0
1
2
3
CPU
Node
Blade
Chassis
0 1 2 3
0
1
2
3
Tarek El-Ghazawi, GWU
Mapping to Graph Theory
39
The application communication pattern can be mapped into a graph Vertices represent the threads
Edges represent interactions between threads
HTF at each level are edge weights Multiple weights per edge
i j
wij1, wij2, wij3, … wijL
Tarek El-Ghazawi, GWU
Hierarchical Graph Partitioning
40
Algorithms can be Bottom Up
Form partitions at lower levels first and recursively group them at higher levels
Top Down Form partitions at upper
levels first and recursively break them at lower levels
Abstract Machine:Level1: Width = 4 (# of locales)
MaxLocaleSize = 4 (# of cores in each locale)
Tarek El-Ghazawi, GWU
Testbed
41
Cray XE6m/XK7m 24 cores per node
Two 12-core AMD Magny Cours Gemini Interconnect
2D Torus
UPC NPB Benchmark from GWU IS – Class C FT – Class C CG – Class C MG – Class C EP – Class C
Heat Diffusion
Tarek El-Ghazawi, GWU
Profiling the application communication – Implementation
42
TAU is selected to profile UPC and MPI programs Generates activity matrix for each bin
Bins are not supported in TAU profiles
Modifications were made to TAU backend and frontends to support bins
Tarek El-Ghazawi, GWU
Customizing GASNet
43
The clustering algorithm usually assigns unequal number of threads to different nodes
The Cray Application Level Placement Scheduler (ALPS) does not support this feature
A modified GASNet Geminie Conduit was used to trick the system to achieve the non-uniform thread count per node Dummy processes are launched Environment variables control how the
runtime pick the correct number of processes on each node
3 8 1 10 2 5 11 12 6 7 0 9
8 3
1
2
5
6 9
07
GASNET_THREAD_MAP
GASNET_NUM_THREADS 10
Tarek El-Ghazawi, GWU
Experimental Results
44
FT – all-to-all communication
0
2
4
6
8
10
12
64 128 256 512 1024
Gain (%
)
Number of Threads
0
0.2
0.4
0.6
0.8
1
64 128 256 512 1024Relative Co
mmun
ication Overheard
Number of Threads
ClusteringSplittingSplitting ‐ Non RestrictedPHAST
Default
Tarek El-Ghazawi, GWU
Experimental ResultsMPI
45
FT – all-to-all communication
012345678
64 128 256 512 1024
Gain (%
)
Number of Threads
ClusteringSplittingSplitting ‐ Non RestrictedPHAST
0
0.2
0.4
0.6
0.8
1
64 128 256 512 1024Number of Threads
Default
Relativ
e Co
mmun
icationOverhead
Tarek El-Ghazawi, GWU
Experimental ResultsUPC
46
CG – Irregular memory access and communication
0
20
40
60
80
100
64 128 256 512 1024
Gain (%
)
Number of Threads
ClusteringSplittingSplitting ‐ Non RestrictedPHAST
0
0.2
0.4
0.6
0.8
1
64 128 256 512 1024Number of Threads
Default
Relativ
e Co
mmun
icationOverhead
Tarek El-Ghazawi, GWU
Experimental ResultsMPI
47
CG – Irregular memory access and communication
051015202530354045
64 128 256 512 1024
Gain (%
)
Number of Threads
ClusteringSplittingSplitting ‐ Non RestrictedPHAST
0
0.2
0.4
0.6
0.8
1
64 128 256 512 1024Number of Threads
Default
Relativ
e Co
mmun
icationOverhead
Tarek El-Ghazawi, GWU
CG – Non Restricted Explanation
48
Node 0
Node 1
Remote
Remote
Tarek El-Ghazawi, GWU September 29, 2016
Overview
49
Fundamental Challenges for Extreme Computing Locality and Hierarchical Locality Programming Models Hardware Support for Productive Locality
Exploitation- Address Remapping Hierarchical Locality Exploitation Concluding Remarks
Tarek El-Ghazawi, GWU
Concluding Remarks
50
Due to energy and bandwidth constrains data movements are becoming too expensive
Locality exploitation is an obvious target Extreme scale architectures are becoming deeply
hierarchical giving rise to hierarchical locality Hierarchical locality exploitation must be done
productively, leaving programmers with the necessary min work to do
We can expect some programming paradigms to provide explicit solutions
Locality-aware programming, hardware support and run-time systems can play a bigger role while keeping programmers productivity
Tarek El-Ghazawi, GWU
Publications
51
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel Hamid Badawy, and Tarek El-Ghazawi, “Exploiting Hierarchical Locality in Deep Parallel Architectures”. ACM Transactions on Architecture and Code Optimizations. Volume 13 Issue 2, June 2016 .
Olivier Serres, Abdullah Kayi, Ahmed Anbar and Tarek El-Ghazawi, “Enabling PGASProductivity with Hardware Support for Shared Address Mapping: A UPC Case Study”. ACMTransactions on Architecture and Code Optimizations. Volume 12 Issue 4, January 2016.
Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres and Tarek El-Ghazawi, “Where ShouldThe Threads Go? Leveraging Hierarchical Data Locality to Solve the Thread AffinityDilemma,” in Proc. 20th International Conference on Parallel and Distributed Systems(ICPADS 2014). IEEE, Hsinchu, Taiwan, Dec 16-19, 2014.
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy , Tarek El-GhazawiPHLAME: Hierarchical Locality Exploitation Using the PGAS Model. IEEE InternationalConference on Partitioned Global Address Space Programming Models (PGAS 2015),Washington DC, September 18-20, 2015.
3. Olivier Serres, Abdullah Kayi, Ahmad Anbar, and Tarek El-Ghazawi, “Enabling PGASproductivity with hardware support for shared address mapping; a UPC case study,” in Proc.16th IEEE International Conference on High Performance Computing and Communications,August 20-22, 2014.
Tarek El-Ghazawi, GWU
Follow up work in Hierarchical Locality Exploitation
52
Use thread data-affinity from locality-aware program as a starting point into a hierarchical locality exploitation system (PHLAME or FLAME: Parallel Hierarchical Abstraction Model of Execution)
Examine best graph partitioning methods Decentralize algorithms, and build in fast predictions to handle the
Exascale Consider dynamic solutions Consider unprofiled cases and collecting intelligence on runs for later
use and optimizations Consider data dependent cases Consider dynamic parallelism cases Investigate hardware support