brian austin, jacob balma, krishna kandalla, kalyan
TRANSCRIPT
![Page 1: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/1.jpg)
GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks
*Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright
SC 19 - Denver, CO (*primary authors contributed equally)
![Page 2: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/2.jpg)
The HPC and Data Center community needs a standard set of benchmarks for characterizing network performance under load.
1. Motivate/introduce GPCNeT: network congestion benchmark2. Describe design of the GPCNeT3. Comparison GPCNeT to congestion seen in production4. Architectural/Site evaluations:
– 4 different DoE Labs– 3 different network architectures – Including Slingshot network with advanced congestion control
Summary of Contributions
2
![Page 3: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/3.jpg)
Sample of work at SC 13-19 focused on network congestion:• There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. SC13• Network Endpoint Congestion Control for Fine-Grained Communication. SC15• Evaluating HPC Networks via Simulation of Parallel Workloads. SC16• Watch Out for the Bully! Job Interference Study on Dragonfly Network. SC16• Run-to-run Variability on Xeon Phi Based Cray XC Systems. SC17• Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing SC18• Understanding Congestion in High Performance Interconnection Networks Using Sampling. SC19• Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. SC19• …….
Despite the importance, there is no standard benchmark to measure network performance under congestion.
Network Congestion is Trending
3
![Page 4: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/4.jpg)
“Tests like ping pong latency are like trying to understand your commute into NYC by driving the route alone at 4am.” – Steve Scott
Best Case Performance is Rare
4
Ping Pong on a quiet system Doing an FFT with congestion
![Page 5: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/5.jpg)
Applications bound by the outliers (tail latency)
HPC Workloads Limited by Congestion
5
Und
erst
andi
ng P
erfo
rman
ce V
aria
bilit
y on
the
Arie
s D
rago
nfly
Net
wor
k, 2
017
# m
easu
rem
ents
latency
P99
5122561286432Process Count
Allr
educ
e La
tenc
y (u
s)
~600X increase
![Page 6: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/6.jpg)
Designing GPCNeT
![Page 7: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/7.jpg)
1. Strike a balance -- Flexible + Representative of common HPC communication patterns
2. Report performance limiting metrics
3. Measure network performance under the effects of Congestion
GPCNeT Design Criteria
7
![Page 8: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/8.jpg)
Topology compatible, can run on:
We need to be able to run on any number of nodes (not just powers of 2)
Designed for Flexible Deployment
8
![Page 9: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/9.jpg)
Probe• Representative
communication pattern on a quiet system
• Baseline for performance– e.g. 30 minutes from
work to home
Baseline with Isolated Probes
9
![Page 10: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/10.jpg)
Probes -- Communication Pattern
10
Natural Ring Random Ring
![Page 11: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/11.jpg)
Probes -- Measurements
11
Probes perform and report congested and isolated:1. Latency2. Bandwidth, and 3. MPI_Allreduce latency
By default a probe occupies 20% of the job nodes
Remaining 80% divided across four congestors
![Page 12: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/12.jpg)
Congestors
• Stress the network to evaluate performance under load– e.g. 30→50 min.
work to home
Evaluate Probes under Stress
12
![Page 13: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/13.jpg)
End-point congestion:● Insensitive to routing
○ Point-to-point Incast○ RMA Incast○ RMA Broadcast
Intermediate Congestion:● Sensitive to bisection
bandwidth and routing● Pairwise all-to-all
The Two Classes of Congestors
13
INTERMEDIATE
END-POINT
END-POINT
![Page 14: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/14.jpg)
Congestors (End-point)
14
RMA Broadcast (get based) RMA Incast (put based)
(point-to-point version not shown)
![Page 15: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/15.jpg)
Congestor (Intermediate)
15
Pairwise All-to-All:● at each iteration,
rank i exchanges with i+1
● n-1 iterations of 4KB exchanges
iteration n-1
![Page 16: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/16.jpg)
1. Divide all nodes into 5 roughly equal size groups– 20% run canaries one at a time (isolated)– 4 groups of 20% each run a specific congestor
2. Measure isolated performance of canary3. Start all 4 congestors4. Measure loaded/congested performance of canary5. Repeat steps 2-4 for each canary
Execution Sequence
16
![Page 17: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/17.jpg)
GPCNeT Informs MPI Performance
17
Distribution of Isolated Probe performance across all processes pairs
Latency
Cou
nt
● 696 Node Cray XC — Cray MPICH MPI● This is the baseline which we compare to
![Page 18: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/18.jpg)
GPCNeT Informs MPI Performance
18
696 Node Cray XC — Cray MPICH MPI
Latency
Cou
nt
![Page 19: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/19.jpg)
Not all Congestors are Created Equal696 Node Cray XC — Cray MPICH MPI
19
Cou
nt
![Page 20: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/20.jpg)
Designing for Robustness vs Best-case 128 Node Cray CS500 with EDR MVAPICH MPI
20
GPCNeT performance encompasses the whole communication/network stack (MPI, Topology, Fabric)
![Page 21: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/21.jpg)
• Congestion and congestion control is increasing importance in next-gen networks
• Introduce GPCNeT for evaluating congestion in HPC networks
• Observed congestion up to 4 orders of magnitude• GPCNeT enables tuning of communication
libraries, and establishing requirements for system performance
Conclusions and Future Work
21
![Page 22: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/22.jpg)
Tuning GPCNeT
![Page 23: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/23.jpg)
Scaling up Process Density
23
Increasing process count per node creates additional sub-communicators and avoids traffic within a node
Node Node
![Page 24: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/24.jpg)
• 696 Nodes• 1, 8 & 32 PPN
• Increase PPN → • increase dimensions• increase hotspots• increase NIC utilization
Tuning Congestion by Process Density
24
1PPN100us
8PPN1000us
32PPN10 ms
![Page 25: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/25.jpg)
Increase Node Count →• Increase degree of
incast
Default (recommend):
fully populate system
• 10% populated (64 nodes) results in limited congestion
Tuning Congestion by Node Count
25
Co
un
t (6
4 n
od
es)
Co
un
t (6
96 n
od
es)
![Page 26: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/26.jpg)
GPCNeT on Production Systems
![Page 27: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/27.jpg)
• 5575 Nodes NERSC Edison (Aries)• 20% nodes run as probes, remaining 80% run in three modes
– Quiet: idle (this is our baseline)– Wild: production traffic (two representative runs)– Congested: four congestors (20:20:20:20)
We show how congestion manifests:
1. Hardware counters of network routers– per-port router stall rate @ 1s
2. Increase to Latency of GPCNeT Allreduce Probes
GPCNeT vs. Congestion in the Wild
27
![Page 28: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/28.jpg)
How does GPCNeT Compare to Production?
28
Widespread intermediate congestion (GPCNeT 3X > Wild)
P99 MPI_Allreduce probes slowed:
• 2200X vs Quiet System• 40X greater than Wild Q3
GPCNeT default is aggressive and stresses the system
![Page 29: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/29.jpg)
7 Systems (4 DoE Production, 3 Cray Testbeds)
• Theta, Edison, Sierra, Summit• System size from 128 to 5.5k nodes• Aries, EDR IB and Slingshot Networks• Fully populated with GPCNeT defaults• Report mean and P99 normalized to baseline
GPCNeT Architectural Comparisons
29
![Page 30: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/30.jpg)
EDR IB100%50%
128
Impact of Congestion on Modern Systems
Slowdown (multiplier) compared to mean baseline (log-scale)
● Node Count● Bisection to
Injection Bandwidth
● Architecture
Aries SS
9000X
5X
P99
Mean
696 485Nodes: 4392 5586 4320 4608Global/Injection BW: 100% 50% 50% 100% 50%
![Page 31: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/31.jpg)
485128
Crystal and Osprey: reduced congestion compared to larger systems of same architecture
Smaller Systems →Less Congestion
Nodes: 696 4392 5586 4320 4608Aries EDR IB SS
![Page 32: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/32.jpg)
696 485128
Smaller Systems →Less Congestion
Nodes: 4392 5586 4320 4608Aries EDR IB SS
Crystal and Osprey: reduced congestion compared to larger systems of same architecture
![Page 33: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/33.jpg)
Aries EDR IB SS
Latency is more Sensitive than Bandwidth
Fact
or o
f Slo
wdo
wn
Larger messages sizes have a larger baseline time to complete transfer
Larger messages can be distributed across multiple paths
![Page 34: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/34.jpg)
Capturing Trade-offs in Bisection BW
Global/Injection BW: 100% 50% 50% 50%100% 100% 50%
Summit has 2X the links going across it’s bisection
443X slowdown (Sierra) vs 135X slowdown (Summit) latency
Production settings
![Page 35: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/35.jpg)
Next Generation Congestion Control
GPCNeT shows the value of congestion control
Slingshot designed to handle the worst case traffic patterns
100%50%128696 485Nodes: 4392 5586 4320 4608
Global/Injection BW: 100% 50% 50% 100% 50%EDR IBAries SS
![Page 36: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/36.jpg)
Next Generation Congestion Control
Aries EDR IB SSGlobal/Injection BW: 100% 50% 50% 50%100% 100% 50%
Routing can’t always eliminate congestion
Slingshot congestion control:1. Identifies source(s)
of congestion2. Throttles the
offending traffic
![Page 37: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/37.jpg)
• Observed congestion up to 4 orders of magnitude• Congestion control is vital in next-gen networks• Introduced GPCNeT for evaluating congestion in
HPC networks– useful tuning of (1) communication libraries,
(2) tuning of routing and congestion control algorithms (3) system procurement
Conclusions
37
![Page 38: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/38.jpg)
Questions?
https://github.com/netbench/GPCNET
Thanks to ORNL (Scott Atchley) and LLNL (Ramesh Pankajakshan) for running GPCNeT on their systems and providing the results for this work.
![Page 39: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/39.jpg)
Backup
39
![Page 40: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/40.jpg)
Backup
40
![Page 41: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/41.jpg)
How do congestors impact a real workload?
• Ran Lulesh on the three smaller test systems– lulesh not as communication bound as canaries
FAQS
41
![Page 42: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/42.jpg)
Not all Congestors are Created Equal
42
696 Node Cray XCCray MPICH MPI
128 Node Cray CS500 with EDRMVAPICH MPI
P2P Incast:No impact on XCSignificant impact on EDR
![Page 43: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/43.jpg)
Not all Congestors are Created Equal
43
696 Node Cray XCCray MPICH MPI
128 Node Cray CS500 with EDRMVAPICH MPI
RMA Bcast:Significant impact on XCNo impact on EDR
P2P Incast:No impact on XCSignificant impact on EDR
![Page 44: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/44.jpg)
OMPI and MVAPICH P2P latency
• 26 node random ring
• similar trends
Differences Across MPI Implementations
44
CS500EDR IB128 nodes
![Page 45: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/45.jpg)
OMPI and MVAPICH Allreduce latency
• 26 node Allreduce
• larger differences for more complex collectives
Differences Across MPI Implementations
45
CS500EDR IB128 nodes
![Page 46: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/46.jpg)
GPCNeT vs. Endpoint Cong. in the Wild
46
• Distribution of port stall rate for entire network
• normalized to mean on quiet system
• Wild-1, Wild-2 are Q1, Q3 production, respectively
Max
Min
P50
1st Quartile 3rd Quartile GPCNeT Canary Only
![Page 47: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/47.jpg)
GPCNeT vs. Endpoint Cong. in the Wild
47
Similar peak stalls (wild vs. cong.)
• more incast of smaller degree in production
![Page 48: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/48.jpg)
GPCNeT vs. Intermediate Cong. in the Wild
48
GPCNeT more aggressive than production traffic at NERSC
• Background traffic varies widely across facilities
![Page 49: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/49.jpg)
GPCNeT vs. Throughput in the Wild
49
• Need > 1PPN for full tput.
• 1PPN tput. 1/5th that of wild-1,2
![Page 50: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/50.jpg)
How is random placement fair for a system like BGQ?
Fragmentation much more common on modern systems
What are other approaches to solving congestion?
• underprovision work• overprovision network• congestion control
FAQS
50
![Page 51: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/51.jpg)
What kind of run-to-run variability did we see?• Random Canary Pairs Shift at iterations• You could have a higher density of congestor roots
within a part of a physical topology for high PPN• Information in verbose mode to track rank mappings
FAQS
51
![Page 52: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan](https://reader031.vdocuments.site/reader031/viewer/2022020700/61f556a3e1dbe34b45180a6f/html5/thumbnails/52.jpg)
Q: Is Congestion Control active on the Infiniband tests for Summit and Sierra?
A: No
Q: Why not?
A: Congestion control does not run in production on Summit or Sierra.
FAQS
52