integrating new capabilities into netpipe
DESCRIPTION
Integrating New Capabilities into NetPIPE. Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This work was funded by the MICS office of the US Department of Energy. N. e. t. w. o. r. k. P. r. o. t. o. c. o. l. - PowerPoint PPT PresentationTRANSCRIPT
Integrating New Capabilities into NetPIPE
Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes
Scalable Computing Laboratory of Ames Laboratory
This work was funded by the MICS office of the US Department of Energy
TCPworkstations
PCs
Cray T3ESGI systems
PVM
TCGMSGruns on
ARMCI or MPI
MPI-21-sided
MPI_Put or MPI_Get
SHMEM1-sided
puts and gets
NetPIPE
2-sidedprotocols
1-sidedprotocols
nativesoftwarelayers
MPI MPICH LAM/MPIMPI/Pro MP_Lite
GMMyrinet cards
InfinibandMellanox VAPI
LAPI
SHMEM& GPSHMEM
ARMCI
IBM SP
Clusters
Network Protocol Independent Performance Evaluator
ARMCITCP, GM, VIA,Quadrics, LAPI
internalsystems
memcpy
+ Basic send/recv with options to guarantee pre-posting or use MPI_ANY_SOURCE. + Option to measure performance without cache effects. + One-sided communications using either Get or Put, with or without fence calls. + Measure performance or do an integrity test.
http://www.scl.ameslab.gov/Projects/NetPIPE/
The NetPIPE utility
NetPIPE does a series of ping-pong tests between two nodes.
Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies.
Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes.
Measuring the overhead of message-passing protocols.
Help in tuning the optimization parameters of message-passing libraries.
Optimizing driver and OS parameters (socket buffer sizes, etc.).
Identifying dropouts in networking hardware and drivers.
Some typical uses
What is not measured NetPIPE cannot measure the load on the CPU yet.
The effects from the different methods for maintaining message progress.
Scalability with system size.
Recent additions to NetPIPE Can do an integrity test instead of measuring performance.
Streaming mode measures performance in 1 direction only.
Must reset sockets to avoid effects from a collapsing window size.
A bi-directional ping-pong mode has been added (-2).
One-sided Get and Put calls can be measured (MPI or SHMEM).
Can choose whether to use an intervening MPI_Fence call to synchronize.
Messages can be bounced between the same buffers (default mode), or they can be started from a different area of memory each time.
There are lots of cache effects in SMP message-passing.
InfiniBand can show similar effects since memory must be registered with the card.
Process 1Process 0
0 12 3
Current projects Overlapping pair-wise ping-pong tests.
Must consider synchronization if not using bi-directional communications.
Investigate other methods for testing the global network.
Evaluate the full range from simultaneous nearest neighbor communications to all-to-all.
Ethernet Switch
n0 n1 n2 n3
n0 n1 n2 n3 Line speed vs end-point limited
Performance on Mellanox InfiniBand cards
A new NetPIPE module allows us to measure the raw performance across InfiniBand hardware (RDMA and Send/Recv).
Burst mode preposts all receives to duplicate the Mellanox test.
The no-cache performance is much lower when the memory has to be registered with the card.
An MP_Lite InfiniBand module will be incorporated into LAM/MPI. 0
1000
2000
3000
4000
5000
6000
7000
100 10,000 1,000,000M essage size in Bytes
Th
rou
gh
pu
t in
Mb
ps
M V AP ICHw/o c ac he effec ts
IB V AP IB urs t m ode
M V AP ICH 7.5 us
IB V AP I S e nd/Re cv
MVAPICH 0.9.1
10 Gigabit Ethernet
Intel 10 Gigabit Ethernet cards
133 MHz PCI-X bus
Single mode fiber
Intel ixgb driver
Can only achieve 2 Gbps now.
Latency is 75 us.
Streaming mode delivers up to 3 Gbps.
Much more development work is needed.
0
500
1000
1500
2000
2500
3000
3500
100 10 ,000 1 ,000 ,000Message s ize in Bytes
Th
rou
gh
pu
t in
Mb
ps
10 G igEs tream ing m ode
10 G igE 75 us
Channel-bonding Gigabit Ethernet for better communications between nodes
Channel-bonding uses 2 or more Gigabit Ethernet cards per PC to increase the communication rate between nodes in a cluster.
GigE cards cost ~$40 each.
24-port switches cost ~$1400.
$100 / computer
This is much more cost effective for PC clusters than using more expensive networking hardware, and may deliver similar performance.
Cache CPU
Memory
PCI NIC
Networkswitch
CacheCPU
Memory
Cache CPU
Memory
CacheCPU
Memory
Channel bonding in a cluster
NIC
PCI NICNIC
PCINICNIC
PCINICNIC
Performance for channel-bonded Gigabit Ethernet
Channel-bonding multiple GigE cards using MP_Lite and Linux kernel bonding
GigE can deliver 900 Mbps with latencies of 25-62 us for PCs with 64-bit / 66 MHz PCI slots.
Channel-bonding 2 GigE cards / PC using MP_Lite doubles the performance for large messages.
Adding a 3rd card does not help much.
Channel-bonding 2 GigE cards / PC using Linux kernel level bonding actually results in poorer performance.
The same tricks that make channel-bonding successful in MP_Lite should make Linux kernel bonding working even better.
Any message-passing system could then make use of channel-bonding on Linux systems.
0
500
1000
1500
2000
2500
100 10,000 1,000,000
Message size in Bytes
Th
rou
gh
pu
t in
Mb
ps
MP_Lite2 GigE
MP_Lite3 GigE
Linux2 GigE
1 GigE card
Channel-bonding in MP_Lite
Applicationon node 0
a b
MP_Lite
User space Kernel space
Largesocketbuffers
b
a
TCP/IP stackdev_q_xmit
device driver
devicequeue
DMA
GigEcard
TCP/IP stackdev_q_xmit
devicequeue
DMA GigEcard
Flow control may stop a given stream at several places.
With MP_Lite channel-bonding, each stream is independent of the others.
Linux kernel channel-bonding
Applicationon node 0
User space Kernel space
Largesocketbuffer
TCP/IPstack
dqx
device driver
devicequeue
DMA
GigEcard
devicequeue
DMA GigEcard
A full device queue will stop the flow at bonding.c to both device queues.
Flow control on the destination node may stop the flow out of the socket buffer.
In both of these cases, problems with one stream can affect both streams.
bonding.c
dqx
dqx
Comparison of high-speed interconnects
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5 us latency.
Atoll delivers 1890 Mbps with a 4.7 us latency.
SCI delivers 1840 Mbps with only a 4.2 us latency.
Myrinet performance reaches 1820 Mbps with an 8 us latency.
Channel-bonded GigE offers 1800 Mbps for very large messages.
Gigabit Ethernet delivers 900 Mbps with a 25-62 us latency.
10 GigE only delivers 2 Gbps with a 75 us latency.
0
1000
2000
3000
4000
5000
6000
7000
100 10 ,000 1 ,000 ,000Message s ize in Bytes
Th
rou
gh
pu
t in
Mb
ps
In fin iBan dw/o cache e ffects
SCI 4 .2 u s
G ig E 62 u s
In fin iBan d RDM A7 .5 us la tency
2xG ig E 62 u s
Ato ll 4 .7 u s
M yr in e t 8 u s
Conclusions• NetPIPE provides a consistent set of analytical tools in the same flexible
framework to many message-passing and native communication layers.• New modules have been developed.
– 1-sided MPI and SHMEM– GM, InfiniBand using the Mellanox VAPI, ARMCI, LAPI– Internal tests like memcpy
• New modes have been incorporated into NetPIPE.– Streaming and bi-directional modes.– Testing without cache effects.– The ability to test integrity instead of performance.
Current projects
• Developing new modules.– ATOLL
– IBM Blue Gene/L
– I/O performance
• Need to be able to measure CPU load during communications.
• Expanding NetPIPE to do multiple pair-wise communications.– Can measure the backplane performance on switches.
– Compare the line speed to end-point limited performance.
• Working toward measuring more of the global properties of a network.– The network topology will need to be considered.
Contact information
Dave Turner - [email protected]
http://www.scl.ameslab.gov/Projects/MP_Lite/
http://www.scl.ameslab.gov/Projects/NetPIPE/
0
100
200
300
400
500
600
700
1 100 10,000 1,000,000Message size in Bytes
Th
rou
gh
pu
t in
Mb
ps
ARMCI
MP_Literaw TCP
LAM/MPI
One-sided Puts between two Linux PCs
MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence.
LAM/MPI has no message progress, so a fence is required.
ARMCI uses a polling method, and therefore does not require a fence.
An MPI-2 implementation of MPICH is under development.
An MPI-2 implementation of MPI/Pro is under development.
Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver
The MP_Lite message-passing library
• A light-weight MPI implementation• Highly efficient for the architectures supported• Designed to be very user-friendly• Ideal for performing message-passing research
http://www.scl.ameslab.gov/Projects/MP_Lite/
MPI Applicationsrestricted to a subset
of the MPI commands
MP_Lite
VIAOS-bypass
TCPworkstations
PCs
SMPshared-memory
segment
SHMEMone-sidedfunctions
MPIto retain portabilityfor MP_Lite syntax
Mixed systemdistributed
SMPs
MP_Lite syntax
Giganet hardware
M-VIA Ethernet
Cray T3E
SGI Origins
InfiniBandMellanox VAPI
0
500
1000
1500
2000
2500
3000
1 100 10,000 1,000,000
Message size in Bytes
Th
rou
gh
pu
t in
Mb
ps
old Cray MPI
MP_Literaw SHMEM
new Cray MPI
A NetPIPE example: Performance on a Cray T3E
Raw SHMEM delivers: 2600 Mbps 2-3 us latency
Cray MPI originally delivered: 1300 Mbps 20 us latency
MP_Lite delivers: 2600 Mbps 9-10 us latency
New Cray MPI delivers: 2400 Mbps 20 us latency
The top of the spikes are where the message size is divisible by 8 Bytes.