optimizing all-to-all algorithm for percs network …

1
P A R A L L E L PROG R AMMI N G L AB D E P T . O F C O M P UT E R S CI E N C E , U N I V E R S I T Y O F I L L I N O I S PPL U I U C Ehsan Totoni PERCS Network BigSim Alltoall Optimization OPTIMIZING ALL-TO-ALL ALGORITHM FOR PERCS NETWORK USING SIMULATION Node Node Node Node Node . . . Supernode Supernode Supernode Supernode Node . . . Charm++ MPI Parallel Applications BigSim Emulator BigSim Simulator Terminal Output Link Utilization Tool Projections Visualizations Traces Logs Stats Communication algorithms play a crucial role in the perfor- mance of large-scale parallel systems. They are implemented in runtime systems and used in most parallel applications as a critical component. As vendors are willing to design new custom networks with significantly different performance properties for their new supercomputers, designing new effi- cient communication algorithms is an inevitable challenge. This task is desirable to be done before the machine comes online since inefficient use of the system before the new algo- rithm's availability is a huge waste of a possibly hundreds of millions of dollars resource. In this poster, we demonstrate the usability of our simulation framework, BigSim, in meeting this challenge. Using BigSim, we observe that the commonly used Pairwise-Exchange algorithm for all-to-all communication pat- tern is suboptimal for the PERCS network. We designed a new all-to-all algorithm for it and predict a five-fold performance improvement for large message sizes using this algorithm. New innovative networks (such as PERCS) are becoming popular for larger machines! Rather than well known ones such as 3D Torus No legacy theory or experience of them! We need to prepare for them before the hundred million dollor machines come online! PERCS Network will be used in many IBM machines It is a two level (complicated) network! Fully connected “Supernodes”: 32 fully-connected nodes 24-GB/s LLocal (LL) links for nodes within a drawer (8 nodes) 5-GB/s LRemote (LR) links across drawers within a Super- node, 10-GB/s D links across Supernodes BigSim is a simulation-based framework used for simulat- ing the behavior of real applications on large parallel ma- chines Charm++ or MPI application is run on the BigSim emulator BigSim is built on Charm++, which allows it to use Charm++'s processor virtualization ability to emulate multi- ple target processors on each physical processor BigSim simulator uses a network model to get actual times These network models include objects that represent proces- sors, nodes, network interface cards, switches, and network links and can simulate contention in the network We have built a BigSim model of the PERCS network and vali- dated it against IBM's MERCURY simulator and hardware proto- types BigSim shows that links are not utilized efficiently using the common Pairwise Exchange algorithm 0 50 100 150 200 250 300 0 600 1200 1800 2400 3000 Links Stacked Utilization (%) Time (ms) Base All-to-All Algorithm’s Link Utilization Link 7 (LR) Link 12 (LR) Link 24 (LR) We propose the following carefully-determined ordering to: (a) simultaneously utilize links stemming from a QCM (b) avoid the undesired congestion and link contention. Each core sends to a different node in each phase! Formally, assume a node-level all-to-all network with n nodes, each containing c cores: 1) Consider a list of t = n*c tasks running on n nodes with c cores each. 2) Each task has to send t - 1 messages, of which sets of c destination cores lie on a given node. Any core can reach a particular set of c cores by using the direct link between the destination node and its home node. 3) In phase i (0≤i≤n-1), core j (0≤j≤c-1) on every node sends data to the set of cores residing in the ((j + i) mod n)th node. 4 POWER7 chips, each with 8 cores form a compute node 192 GB/s to a Hub Chip in a Quad Chip Module (QCM) Different components inside the Hub Chip: HFI, ISR ... 2 4 6 8 10 12 14 32 64 128 256 512 1024 2048 4096 Time (s) Message Size (KB) Performance Comparison of All2All Base A2A Improved A2A Bandwidth Optimal 0 50 100 150 200 250 300 0 1380 2760 4140 5520 6900 Links Stacked Utilization (%) Time (us) Improved All-to-All Algorithm’s Link Utilization Link 7 (LR) Link 12 (LR) Link 24 (LR) Up to 5 mes improvement for large messages! Links are now ulized simultaneously Laxmikant V. Kale {totoni2, kale}@illinois.edu hp://charm.cs.illinois.edu Reference: ”Simulaon-based Performance Analysis and Tuning for a Two-level Directly Con- nected System”, E. Totoni, A. Bhatele, E. J. Bohm, N. Jain, C. Mendes, R. Mokos, G. Zheng, L. V. Kale, to appear in ICPADS 2011

Upload: others

Post on 19-Jun-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OPTIMIZING ALL-TO-ALL ALGORITHM FOR PERCS NETWORK …

PA R A L L E LPROGRAMMING LABDEPT. OF COMPUTER SCIENCE, UNIVERSITY OF ILLINOIS

PPLU I U C

Ehsan Totoni PERCS Network BigSim Alltoall Optimization

OPTIMIZING ALL-TO-ALL ALGORITHM FOR PERCS NETWORK USING SIMULATION

Node

Node

Node

Node

Node. . .

Supernode

Supernode

Supernode

Supernode

Node. . .

Charm++ MPI

Parallel Applications

BigSim Emulator BigSim Simulator Terminal Output

Link Utilization Tool

Projections Visualizations

Traces

Logs

Stats

Communication algorithms play a crucial role in the perfor-mance of large-scale parallel systems. They are implemented in runtime systems and used in most parallel applications as a critical component. As vendors are willing to design new custom networks with signi�cantly di�erent performance properties for their new supercomputers, designing new e�-cient communication algorithms is an inevitable challenge. This task is desirable to be done before the machine comes online since ine�cient use of the system before the new algo-rithm's availability is a huge waste of a possibly hundreds of millions of dollars resource. In this poster, we demonstrate the usability of our simulation framework, BigSim, in meeting this challenge. Using BigSim, we observe that the commonly used Pairwise-Exchange algorithm for all-to-all communication pat-tern is suboptimal for the PERCS network. We designed a new all-to-all algorithm for it and predict a �ve-fold performance improvement for large message sizes using this algorithm.

New innovative networks (such as PERCS) are becoming popular for larger machines! ● Rather than well known ones such as 3D TorusNo legacy theory or experience of them!We need to prepare for them before the hundred million dollor machines come online!PERCS Network will be used in many IBM machines● It is a two level (complicated) network!● Fully connected “Supernodes”: 32 fully-connected nodes● 24-GB/s LLocal (LL) links for nodes within a drawer (8 nodes)● 5-GB/s LRemote (LR) links across drawers within a Super-node, 10-GB/s D links across Supernodes

BigSim is a simulation-based framework used for simulat-ing the behavior of real applications on large parallel ma-chines

● Charm++ or MPI application is run on the BigSim emulator● BigSim is built on Charm++, which allows it to use Charm++'s processor virtualization ability to emulate multi-ple target processors on each physical processor

BigSim simulator uses a network model to get actual times● These network models include objects that represent proces-sors, nodes, network interface cards, switches, and network links and can simulate contention in the network

● We have built a BigSim model of the PERCS network and vali-dated it against IBM's MERCURY simulator and hardware proto-types

BigSim shows that links are not utilized e�ciently using the common Pairwise Exchange algorithm

0

50

100

150

200

250

300

0 600 1200 1800 2400 3000

Lin

ks S

tack

ed U

tili

zati

on (

%)

Time (ms)

Base All-to-All Algorithm’s Link Utilization

Link 7 (LR)Link 12 (LR)Link 24 (LR)

● We propose the following carefully-determined ordering to: (a) simultaneously utilize links stemming from a QCM(b) avoid the undesired congestion and link contention.

Each core sends to a di�erent node in each phase!● Formally, assume a node-level all-to-all network with n nodes, each containing c cores:

1) Consider a list of t = n*c tasks running on n nodes with c cores each.

2) Each task has to send t − 1 messages, of which sets of c destination cores lie on a given node. Any core can reach a particular set of c cores by using the direct link between the destination node and its home node.

3) In phase i (0≤i≤n−1), core j (0≤j≤c−1) on every node sends data to the set of cores residing in the ((j + i) mod n)th node.

● 4 POWER7 chips, each with 8 cores form a compute node192 GB/s to a Hub Chip in a Quad Chip Module (QCM)● Di�erent components inside the Hub Chip: HFI, ISR ...

2

4

6

8

10

12

14

32 64 128 256 512 1024 2048 4096

Tim

e (

s)

Message Size (KB)

Performance Comparison of All2All

Base A2AImproved A2A

Bandwidth Optimal

0

50

100

150

200

250

300

0 1380 2760 4140 5520 6900

Lin

ks S

tacked U

tili

zati

on (

%)

Time (us)

Improved All-to-All Algorithm’s Link Utilization

Link 7 (LR)Link 12 (LR)Link 24 (LR)

Up to 5 times improvement for large messages!Links are now utilized simultaneously

Laxmikant V. Kale{totoni2, kale}@illinois.edu

http://charm.cs.illinois.edu

Reference: ”Simulation-based Performance Analysis and Tuning for a Two-level Directly Con-nected System”, E. Totoni, A. Bhatele, E. J. Bohm, N. Jain, C. Mendes, R. Mokos, G. Zheng, L. V. Kale, to appear in ICPADS 2011