peter wegner, desy chep03, 25 march 2003 1 lqcd benchmarks on cluster architectures m. hasenbusch,...
TRANSCRIPT
Peter Wegner, DESY CHEP03, 25 March 2003 1
LQCD benchmarks on cluster architecturesM. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen),
A. Gellrich, H.Wittig (DESY Hamburg)
CHEP03, 25 March 2003Category 6: Lattice Gauge Computing
Motivation
PC Cluster @DESY
Benchmark architectures
DESY Cluster
E7500 systems
Infiniband blade servers
Itanium2
Benchmark programs, Results
Future
Conclusions, Acknowledgements
Peter Wegner, DESY CHEP03, 25 March 2003 2
PC Cluster Motivation
LQCD, Stream Benchmark, Myrinet Bandwidth
32/64-bit Dirac Kernel, LQCD (Martin Lüscher, (DESY) CERN, 2000):P4, 1.4 GHz, 256 MB Rambus, using SSE1(2) instructions incl. cache pre-
fetchTime per lattice point:
0.926 micro sec (1503 Mflops [32 bit arithmetic])1.709 micro sec (814 Mflops [64 bit arithmetic])
Stream Benchmark, Memory Bandwidth:P4(1.4 GHz, PC800 Rambus) : 1.4 … 2.0 GB/s PIII (800MHz, PC133 SDRAM) : 400 MB/s PIII(400 MHz, PC133 SDRAM) : 340 MB/s
Myrinet, external Bandwidth: 2.0+2.0 Gb/s optical-connection, bidirectional, ~240 MB/s sustained
Peter Wegner, DESY CHEP03, 25 March 2003 3
Benchmark Architectures - DESY Cluster Hardware
Nodes Mainboard Supermicro P4DC6 2 x XEON P4, 1.7 (2.0) GHz, 256 (512) kByte Cache1 Gbyte (4x 256 Mbyte) RDRAMIBM 18.3 GB DDYS-T18350 U160 3.5” SCSI diskMyrinet 2000 M3F-PCI64B-2 Interface
Network Fast Ethernet Switch Gigaline 2024M, 48x100BaseTX ports + GIGAline2024 1000BaseSX-SC
Myrinet Fast Interconnect M3-E32 5 slot chassis, 2xM3-SW16 Line cards
InstallationZeuthen: 16 dual CPU nodes, Hamburg: 32 dual CPU nodes
800MB/s
64 b
it P
CI
64 b
it P
CI
P64HP64H
>1GB/s
64 b
it P
CI
64 b
it P
CI
800MB/s
P64HP64H
Benchmark Architectures DESY Cluster i860 chipset problem
133 MB/sIICH2
MCHMCH
++
++ ++
++
PCI Slots (33 MHz, , 32bit)
4 USB ports
LANConnection
Interface
ATA 100 MB/s(dual IDE Channels)
6 channel audio
10/100 Ethernet
Intel® HubArchitecture
26
6 M
B/s
3.2
GB
/s
Xeon Xeon ProcessorProcessor
3.2GB/s
DualChannelRDRAM*
AG
P4
XA
GP
4X
Gra
ph
ics
Gra
ph
ics
400MHzSystem
Bus
Xeon Xeon ProcessorProcessor
PCI Slots (66 MHz, 64bit)
MRHMRH
MRHMRH
Up to 4 GB of RDRAM
bus_read (send) = 227 MBytes/sbus_write (recv) = 315 MBytes/s of max. 528 MBytes/s External Myrinet bandwidth: 160 Mbytes/s
90 Mbytes/s bidirectional
Peter Wegner, DESY CHEP03, 25 March 2003 5
Benchmark Architectures – Intel E7500 chipset
Peter Wegner, DESY CHEP03, 25 March 2003 6
Benchmark Architectures - E7500 system
Par-Tec (Wuppertal)4 Nodes: Intel(R) Xeon(TM) CPU 2.60GHz
2 GB ECC PC1600 (DDR-200) SDRAM Super Micro P4DPE-G2Intel E7500 chipsetPCI 64/66 2 x Intel(R) PRO/1000 Network ConnectionMyrinet M3F-PCI64B-2
Peter Wegner, DESY CHEP03, 25 March 2003 7
Benchmark Architectures
Leibniz-Rechenzentrum Munich (single cpu tests):Pentium IV 3,06GHz. with ECC RambusPentium IV 2,53GHz. with Rambus 1066 memoryXeon, 2.4GHz. with PC2100 DDR SDRAM memory
(probably FSB400)
Megware:8 nodes dual XEON, 2.4GHz, E75002GB DDR ECC memory Myrinet2000 Supermicro P4DMS-6GM
University of Erlangen:Itanium2, 900MHz, 1.5MB Cache, 10GB RAMzx1 chipset (HP)
Peter Wegner, DESY CHEP03, 25 March 2003 8
Benchmark Architectures - Infiniband
Megware:
10 Mellanox ServerBlades Single Xeon 2.2 GHz2 GB DDR RAMServerWorks GC-LE ChipsatzInfiniBand 4X HCARedHat 7.3, Kernel 2.4.18-3MPICH-1.2.2.2 und OSU-Patch für VIA/InfiniBand 0.6.5Mellanox Firmware 1.14Mellanox SDK (VAPI) 0.0.4Compiler GCC 2.96
Peter Wegner, DESY CHEP03, 25 March 2003 9
Dirac Operator Benchmark (SSE) 16x163, single P4/XEON CPU
0
500
1000
1500
2000
2500
3000
1.4GHz
2.4GHz
2.53GHz
3.06GHz
Dirac operator Linear AlgebraMFLOPS
Peter Wegner, DESY CHEP03, 25 March 2003 10
Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd
preconditioned, 2 x 163 , XEON CPUs, single CPU performance
0
50
100
150
200
250
300
350
400
450
500
4 single(4 CPUs)
8 single(8 CPUs)
16 single(16
CPUs)
2 dual (4CPUs)
4 dual (8CPUs)
8 dual (16CPUs)
16 dual(32
CPUs)
1.7 GHz, i860
2.0 GHz, i860
2.4 GHz, E7500
MFLOPS
Myrinet2000i860: 90 MB/s
E7500: 190 MB/s
Peter Wegner, DESY CHEP03, 25 March 2003 11
Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd
preconditioned, 2 x 163 , XEON CPUs, single CPU performance,2, 4 nodes
Single node Dual node
SSE2 non-SSE SSE2 non-SSE
446 328 (74%) 330 283 (85%)
blocking non-blocking I/O
308 367 (119%)
Parastation3 software non-blocking I/O support (MFLOPS, non-SSE):
Performance comparisons (MFLOPS):
Peter Wegner, DESY CHEP03, 25 March 2003 12
Maximal Efficiency of external I/O
MFLOPs (without communication)
MFLOPS(with communication)
Maximal Bandwidth
Efficiency
Myrinet (i860),SSE
579 307 90 + 90 0.53
Myrinet/GM(E7500), SSE
631 432 190 + 190 0.68
Myrinet/Parastation (E7500), SSE 675 446 181 + 181 0.66
Myrinet/Parastation (E7500), non-blocking, non-SSE
406 368hidden
0.91
Gigabit, Ethernet, non-SSE 390 228 100 + 100 0.58
Infinibandnon-SSE
370 297 210 + 210 0.80
Peter Wegner, DESY CHEP03, 25 March 2003 13
Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd
preconditioned, 2 x 163 , XEON/Itanium2 CPUs, single CPU performance, 4 nodes
4 single CPU nodes, Gbit Ethernet, non-blocking switch, full duplex
P4 (2.4 GHz, 0.5 MB cache) SSE: 285 MFLOPS 88.92 + 88.92 MB/snon-SSE: 228 MFLOPS 75.87 + 75.87 MB/s
Itanium2 (900 MHz, 1.5 MB cache)non-SSE: 197 MFLOPS 63.13 + 63.13 MB/s
Peter Wegner, DESY CHEP03, 25 March 2003 14
Infiniband interconnect
Link:High Speed Serial1x, 4x, and 12x
Host Channel Adapter:•Protocol Engine•Moves data via messages queued in memory
Switch:Simple, low cost, multistage network
Target Channel Adapter:Interface to I/O controllerSCSI, FC-AL, GbE, ...
I/OCntlr
TCA
SysMem
CPU
CPU
MemCntlr
Hos
t B
us
HCA
Lin
k
Lin
k
LinkLink Switch
LinkLink
Lin
k
Lin
k
SysMem
HCA
MemCntlr
Host Bus
CPU CPU
TC
A I/OCntlr
http://www.infinibandta.org
up to 10GB/sBi-directional
Chips : IBM, Mellanox
PCI-X cards: Fujitsu, Mellanox,JNI, IBM
Peter Wegner, DESY CHEP03, 25 March 2003 15
Infiniband interconnect
Peter Wegner, DESY CHEP03, 25 March 2003 16
Parallel (2-dim) Dirac Operator Benchmark (Ginsparg-Wilson-Fermions) , XEON CPUs, single CPU performance, 4 nodes
Infiniband vs Myrinet performance, non-SSE (MFLOPS):
XEON 1.7 GHz Myrinet, i860 chipset
XEON 2.2 GHz Infiniband, E7500 chipset
32-Bit 64-Bit 32-Bit 64-Bit
8x83 lattice,
2x2 processor grid
370 281 697 477
16x163 lattice,
2x4 processor grid
338 299 609 480
Peter Wegner, DESY CHEP03, 25 March 2003 17
Future - Low Power Cluster Architectures ?
Peter Wegner, DESY CHEP03, 25 March 2003 18
Future Cluster Architectures - Blade Servers ?
NEXCOM – Low voltage blade server200 low voltage Intel XEON CPUs (1.6 GHz – 30W)in a 42U RackIntegrated Gbit Ethernet network
Mellanox – Infiniband blade server
Single XEON Blades connected via a 10 Gbit (4X) Infinibandnetwork
MEGWARE, NCSA, Ohio State University
Peter Wegner, DESY CHEP03, 25 March 2003 19
Conclusions
PC CPUs have an extremely high sustained LQCD performance using SSE/SSE2 (SIMD+pre-fetch), assuming a sufficient large local lattice
Bottlenecks are the memory throughput and the external I/O bandwidth, both components are improving
(Chipsets: i860 E7500 E705 …,
FSB: 400MHz 533 MHz 667 MHz …,
external I/O: Gbit-Ethernet Myrinet2000 QSnet Inifiniband …)
Non-blocking MPI communication can improve the performance by using adequate MPI implementations (e.g. ParaStation)
32-bit Architectures (e.g. IA32) have a much better price performance ratio than 64-bit architectures (Itanium, Opteron ?)
Large low voltage dense blade clusters could play an important role in LQCD computing (low voltage XEON, CENTRINO ?, …)
Peter Wegner, DESY CHEP03, 25 March 2003 20
Acknowledgements
Acknowledgements
We would like to thank Martin Lüscher (CERN) for the benchmark codes and the fruitful discussions about PCs for LQCD,
and
Isabel Campos Plasencia (Leibnitz-Rechenzentrum Munich), Gerhard Wellein (Uni Erlangen), Holger Müller (Megware), Norbert Eicker (Par-Tec), Chris Eddington (Mellanox) for the opportunity to run the benchmarks on their clusters.