Current Trends in HPC§ Tremendous increase in system and job sizes§ Dense many-core systems becoming popular§ Less memory available per process§ Fast and scalable job startup is essential
Importance of Fast Job Startup§ Development and debugging§ Regression/Acceptance testing§ Checkpoint restart
Performance BottlenecksStatic Connection Setup§ Setting up O(num_procs2) connections is expensive§ OpenSHMEM, UPC and other PGAS libraries lack
on-demand connection managementNetwork Address Exchange over PMI§ Limited scalability, no potential for overlap§ Not optimized for symmetric exchangeGlobal Barriers§ Unnecessary synchronization and connection setup
Memory Scalability Issues§ Each node requires O(number of processes *
processes per node) memory for storing remote endpoint information
Proposed SolutionsOn-demand Connection Management§ Exchange information and establish connection only
when two peers are trying to communicate[1]
PMIX_Ring Extension§ Move bulk of the data exchange to high-
performance networks[2]
Non-blocking PMI Collectives§ Overlap the PMI exchange with other tasks[3]
Shared-memory based PMI Get/Allgather§ All clients access data directly from the launcher
daemon through shared memory regions[4]
Summary§ Near-constant MPI and OpenSHMEM initialization
time at any process count§ 10x and 30x improvement in startup time of MPI and
OpenSHMEM respectively at 16,384 processes§ Reduce memory consumption by O(ppn)§ 1GB Memory saved @ 1M processes and 16 ppn
References[1] On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. (Chakraborty et al, HIPS ’15)[2] PMI Extensions for Scalable MPI Startup. (Chakraborty et al, EuroMPI/Asia ’14)[3] Non-blocking PMI Extensions for Fast MPI Startup. (Chakraborty et al, CCGrid ’15)[4] SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability.(Chakraborty et al, CCGrid ’16)
More Information§ Available in latest MVAPICH2 and MVAPICH2-X§ http://mvapich.cse.ohio-state.edu/downloads/§ https://go.osu.edu/mvapich-startup
Job Startup at Exascale: Challenges and Solutions
Results - Non-blocking PMI[3]
§ Near-constant MPI_Init at any scale§ MPI_Init with Iallgather performs 288% better than
the default based on Fence§ Blocking Allgather is 38% faster than blocking Fence
0.4
0.8
1.2
1.6
2
64 256 1K 4K 16K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance of MPI_Init
Fence
Ifence
Allgather
Iallgather
0
0.4
0.8
1.2
1.6
64 256 1K 4K 16K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance Comparison of Fence and Allgather
PMI2_KVS_Fence
PMIX_Allgather
PMIX_KVS_Ifence§ Non-blocking version of PMI2_KVS_Fence§ int PMIX_KVS_Ifence(PMIX_Request *request)
PMIX_Iallgather§ Optimized for symmetric data movement§ Reduces data movement by up to 30%§ 286KB → 208KB with 8,192 processes§ int PMIX_Iallgather(const char value[],
char buffer[], PMIX_Request *request)
PMIX_Wait § Wait for the specified request to be completed§ int PMIX_Wait(PMIX_Request request)
Non-blocking PMI Collectives§ PMI operations are progressed by separate
processes handling process management§ MPI library not involved in progressing PMI
communication§ Similar to Functional Partitioning approaches§ Can be overlapped with other initialization tasks
PMIX_Request§ Non-blocking collectives return before the
operations is completed§ Return an opaque handle to the request object that
can be used to check for completion
Results – On-demand Connection[1]
§ 29.6 times faster initialization time§ Hello world performs 8.31 times better§ Execution time of NAS benchmarks improved by
up to 35% with 256 processes and class B data
On-demand Connection EstablishmentStatic Connection Setup§ Setting up connections takes over 85% of the total
startup time with 4,096 processes§ RDMA operations require exchanging information
about memory segments registered with the HCA
ResultsSolutionChallengesOverview
Results – PMI Ring Extension[2]
§ MPI_Init based on PMIX_Ring performs 34% better compared to the default PMI2_KVS_Fence
§ Hello World runs 33% faster with 8K processes§ Up to 20% improvement in total execution time of
NAS parallel benchmarks
0
1
2
3
4
5
6
7
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(sec
onds
)
Number of Processes
Performance of MPI_Init andHello World with PMIX_Ring
Hello World (Fence)
Hello World (Ring)
MPI_Init (Fence)
MPI_Init (proposed)
0
1
2
3
4
5
6
7
EP MG CG FT BT SP
Tim
e Ta
ken
(sec
onds
)
Benchmark
NAS Benchmarks with1K Processes, Class B Data
Fence
Ring
New Collective – PMIX_Ring§ A ring can be formed by exchanging data with only
the left and the right neighbors§ Once the ring is formed, data can be exchanged
over the high speed networks like InfiniBand§ int PMIX_Ring(char value[], char left[], char right[], …)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
16 64 256 1K 4K 16K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
PMI Fence with differentNumber of Puts
100% Put + Fence
50% Put + Fence
Fence Only
0
4
8
12
16
16 64 256 1K 4K 16K
Tim
e Ta
ken
(mill
iseco
nds)
Number of Processes
Time Taken by PMIX_Ring
Shortcomings of Current PMI design§ Puts and Gets are local operations§ Fence consumes most of the time§ Time taken for Fence grows approximately linearly
with amount of data transferred (number of keys)
0
0.5
1
1.5
2
2.5
32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Breakdown of MPI_Init
PMI Exchanges
Shared Memory
Other
0
1
2
3
4
5
6
7
16 64 256 1k 4k 16k
Tim
e Ta
ken
(sec
onds
)
Number of Processes
Time spent in Put, Fence, & Get
Fence
Put
Gets
Main Thread
Main Thread
ConnectionManager Thread
ConnectionManager Thread
Process 1 Process 2
Put/Get(P2)
Create QPQP→Init
Enqueue Send
Create QPQP→InitQP→RTR
QP→RTRQP→RTS
Connection EstablishedDequeue Send
Connect Request(LID, QPN)
(address, size, rkey)
Connect Reply(LID, QPN)
(address, size, rkey)
QP→RTS
Connection Established
Put/Get(P2)
0
2
4
6
8
BT EP MG SP
Exec
utio
n Ti
me
(sec
onds
)
Benchmark
Execution time of OpenSHMEM NAS Parallel Benchmarks
Static
On-demand
0
20
40
60
80
100
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance of OpenSHMEMInitialization and Hello World
Hello World - Static
start_pes - Static
Hello World - On-demand
start_pes - On-demand
0
5
10
15
20
25
30
35
32 64 128 256 512 1K 2K 4K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Breakdown of OpenSHMEM Initialization
Connection Setup
PMI Exchange
Memory Registration
Shared Memory Setup
Other
Application Processes Average Peers
BT64 8.7
1024 10.6
EP64 3.0
1024 5.0
MG64 9.5
1024 11.9
SP64 8.8
1024 10.7
2D Heat64 5.3
1024 5.4
Results – Shared Memory based PMI[4]
§ PMI Get takes 0.25 ms with 32 ppn§ 1,000 times reduction in PMI Get latency compared
to default socket based protocol§ Memory footprint reduced by O(Processes Per
Node) ≈ 1GB @ 1M processes, 16 ppn§ Backward compatible, negligible overhead
Shared Memory (shmem) based PMI§ Open a shared memory channel between the
server and the clients§ A hash table is suitable for Fence while Allgather
only requires an array of values§ Use a hash table based on two shmem regions for
efficient insertion and merge, and compactness
Memory Scalability in PMI§ PMI communication between the server and the
clients are based on local sockets§ Latency is high with large number of clients§ Copying data to client’s memory causes large
memory overhead
Sourav Chakraborty, Dhabaleswar K Panda, (Advisor), The Ohio State University
0
50
100
150
200
250
300
1 2 4 8 16 32
Tim
e Ta
ken
(mill
iseco
nds)
Number of Processes
Time Taken by one PMI_Get
0.01
0.1
1
10
100
1000
10000
Mem
ory
Usa
ge p
er N
ode
(MB)
Number of Processes
PMI Memory UsagePPN = 16
PPN = 32
PPN = 64
0
50
100
150
200
250
300
1 2 4 8 16 32
Tim
e Ta
ken
(mill
iseco
nds)
Number of Processes
Time Taken by one PMI_Get
Default
Shmem
1
10
100
1000
10000
32K 64K 128K 256K 512K 1M
Mem
ory
Usa
ge p
er N
ode
(MB)
Number of Processes
PMI Memory UsageFence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem
Empty Head Tail Key Value Next
Hash Table (Table)
Key Value Store (KVS)
Top
P M
O
Job Startup Performance
Mem
ory
Req
uire
d to
Sto
re
Endp
oint
Info
rmat
ion
a b c d
eP
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – Optimized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
b
c
d
e
a On-demand Connection
b
c d
e
a