peter wegner, desy chep03, 25 march 2003 1 lqcd benchmarks on cluster architectures m. hasenbusch,...

20
Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig (DESY Hamburg) CHEP03, 25 March 2003 Category 6: Lattice Gauge Computing Motivation PC Cluster @DESY Benchmark architectures DESY Cluster E7500 systems Infiniband blade servers Itanium2 Benchmark programs, Results Future Conclusions, Acknowledgements

Upload: adam-booth

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 1

LQCD benchmarks on cluster architecturesM. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen),

A. Gellrich, H.Wittig (DESY Hamburg)

CHEP03, 25 March 2003Category 6: Lattice Gauge Computing

Motivation

PC Cluster @DESY

Benchmark architectures

DESY Cluster

E7500 systems

Infiniband blade servers

Itanium2

Benchmark programs, Results

Future

Conclusions, Acknowledgements

Page 2: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 2

PC Cluster Motivation

LQCD, Stream Benchmark, Myrinet Bandwidth

32/64-bit Dirac Kernel, LQCD (Martin Lüscher, (DESY) CERN, 2000):P4, 1.4 GHz, 256 MB Rambus, using SSE1(2) instructions incl. cache pre-

fetchTime per lattice point:

0.926 micro sec (1503 Mflops [32 bit arithmetic])1.709 micro sec (814 Mflops [64 bit arithmetic])

Stream Benchmark, Memory Bandwidth:P4(1.4 GHz, PC800 Rambus) : 1.4 … 2.0 GB/s PIII (800MHz, PC133 SDRAM) : 400 MB/s PIII(400 MHz, PC133 SDRAM) : 340 MB/s

Myrinet, external Bandwidth: 2.0+2.0 Gb/s optical-connection, bidirectional, ~240 MB/s sustained

Page 3: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 3

Benchmark Architectures - DESY Cluster Hardware

Nodes Mainboard Supermicro P4DC6 2 x XEON P4, 1.7 (2.0) GHz, 256 (512) kByte Cache1 Gbyte (4x 256 Mbyte) RDRAMIBM 18.3 GB DDYS-T18350 U160 3.5” SCSI diskMyrinet 2000 M3F-PCI64B-2 Interface

Network Fast Ethernet Switch Gigaline 2024M, 48x100BaseTX ports + GIGAline2024 1000BaseSX-SC

Myrinet Fast Interconnect M3-E32 5 slot chassis, 2xM3-SW16 Line cards

InstallationZeuthen: 16 dual CPU nodes, Hamburg: 32 dual CPU nodes

Page 4: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

800MB/s

64 b

it P

CI

64 b

it P

CI

P64HP64H

>1GB/s

64 b

it P

CI

64 b

it P

CI

800MB/s

P64HP64H

Benchmark Architectures DESY Cluster i860 chipset problem

133 MB/sIICH2

MCHMCH

++

++ ++

++

PCI Slots (33 MHz, , 32bit)

4 USB ports

LANConnection

Interface

ATA 100 MB/s(dual IDE Channels)

6 channel audio

10/100 Ethernet

Intel® HubArchitecture

26

6 M

B/s

3.2

GB

/s

Xeon Xeon ProcessorProcessor

3.2GB/s

DualChannelRDRAM*

AG

P4

XA

GP

4X

Gra

ph

ics

Gra

ph

ics

400MHzSystem

Bus

Xeon Xeon ProcessorProcessor

PCI Slots (66 MHz, 64bit)

MRHMRH

MRHMRH

Up to 4 GB of RDRAM

bus_read (send) = 227 MBytes/sbus_write (recv) = 315 MBytes/s of max. 528 MBytes/s External Myrinet bandwidth: 160 Mbytes/s

90 Mbytes/s bidirectional

Page 5: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 5

Benchmark Architectures – Intel E7500 chipset

Page 6: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 6

Benchmark Architectures - E7500 system

Par-Tec (Wuppertal)4 Nodes: Intel(R) Xeon(TM) CPU 2.60GHz

2 GB ECC PC1600 (DDR-200) SDRAM Super Micro P4DPE-G2Intel E7500 chipsetPCI 64/66 2 x Intel(R) PRO/1000 Network ConnectionMyrinet M3F-PCI64B-2

Page 7: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 7

Benchmark Architectures

Leibniz-Rechenzentrum Munich (single cpu tests):Pentium IV 3,06GHz. with ECC RambusPentium IV 2,53GHz. with Rambus 1066 memoryXeon, 2.4GHz. with PC2100 DDR SDRAM memory

(probably FSB400)

Megware:8 nodes dual XEON, 2.4GHz, E75002GB DDR ECC memory Myrinet2000 Supermicro P4DMS-6GM

University of Erlangen:Itanium2, 900MHz, 1.5MB Cache, 10GB RAMzx1 chipset (HP)

Page 8: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 8

Benchmark Architectures - Infiniband

Megware:

10 Mellanox ServerBlades Single Xeon 2.2 GHz2 GB DDR RAMServerWorks GC-LE ChipsatzInfiniBand 4X HCARedHat 7.3, Kernel 2.4.18-3MPICH-1.2.2.2 und OSU-Patch für VIA/InfiniBand 0.6.5Mellanox Firmware 1.14Mellanox SDK (VAPI) 0.0.4Compiler GCC 2.96

Page 9: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 9

Dirac Operator Benchmark (SSE) 16x163, single P4/XEON CPU

0

500

1000

1500

2000

2500

3000

1.4GHz

2.4GHz

2.53GHz

3.06GHz

Dirac operator Linear AlgebraMFLOPS

Page 10: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 10

Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd

preconditioned, 2 x 163 , XEON CPUs, single CPU performance

0

50

100

150

200

250

300

350

400

450

500

4 single(4 CPUs)

8 single(8 CPUs)

16 single(16

CPUs)

2 dual (4CPUs)

4 dual (8CPUs)

8 dual (16CPUs)

16 dual(32

CPUs)

1.7 GHz, i860

2.0 GHz, i860

2.4 GHz, E7500

MFLOPS

Myrinet2000i860: 90 MB/s

E7500: 190 MB/s

Page 11: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 11

Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd

preconditioned, 2 x 163 , XEON CPUs, single CPU performance,2, 4 nodes

Single node Dual node

SSE2 non-SSE SSE2 non-SSE

446 328 (74%) 330 283 (85%)

blocking non-blocking I/O

308 367 (119%)

Parastation3 software non-blocking I/O support (MFLOPS, non-SSE):

Performance comparisons (MFLOPS):

Page 12: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 12

Maximal Efficiency of external I/O

MFLOPs (without communication)

MFLOPS(with communication)

Maximal Bandwidth

Efficiency

Myrinet (i860),SSE

579 307 90 + 90 0.53

Myrinet/GM(E7500), SSE

631 432 190 + 190 0.68

Myrinet/Parastation (E7500), SSE 675 446 181 + 181 0.66

Myrinet/Parastation (E7500), non-blocking, non-SSE

406 368hidden

0.91

Gigabit, Ethernet, non-SSE 390 228 100 + 100 0.58

Infinibandnon-SSE

370 297 210 + 210 0.80

Page 13: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 13

Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd

preconditioned, 2 x 163 , XEON/Itanium2 CPUs, single CPU performance, 4 nodes

4 single CPU nodes, Gbit Ethernet, non-blocking switch, full duplex

P4 (2.4 GHz, 0.5 MB cache) SSE: 285 MFLOPS 88.92 + 88.92 MB/snon-SSE: 228 MFLOPS 75.87 + 75.87 MB/s

Itanium2 (900 MHz, 1.5 MB cache)non-SSE: 197 MFLOPS 63.13 + 63.13 MB/s

Page 14: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 14

Infiniband interconnect

 

 

Link:High Speed Serial1x, 4x, and 12x

Host Channel Adapter:•Protocol Engine•Moves data via messages queued in memory

Switch:Simple, low cost, multistage network

Target Channel Adapter:Interface to I/O controllerSCSI, FC-AL, GbE, ...

I/OCntlr

TCA

SysMem

CPU

CPU

MemCntlr

Hos

t B

us

HCA

Lin

k

Lin

k

LinkLink Switch

LinkLink

Lin

k

Lin

k

SysMem

HCA

MemCntlr

Host Bus

CPU CPU

TC

A I/OCntlr

http://www.infinibandta.org

up to 10GB/sBi-directional

Chips : IBM, Mellanox

PCI-X cards: Fujitsu, Mellanox,JNI, IBM

Page 15: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 15

Infiniband interconnect

 

 

Page 16: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 16

Parallel (2-dim) Dirac Operator Benchmark (Ginsparg-Wilson-Fermions) , XEON CPUs, single CPU performance, 4 nodes

Infiniband vs Myrinet performance, non-SSE (MFLOPS):

XEON 1.7 GHz Myrinet, i860 chipset

XEON 2.2 GHz Infiniband, E7500 chipset

32-Bit 64-Bit 32-Bit 64-Bit

8x83 lattice,

2x2 processor grid

370 281 697 477

16x163 lattice,

2x4 processor grid

338 299 609 480

Page 17: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 17

Future - Low Power Cluster Architectures ?

Page 18: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 18

Future Cluster Architectures - Blade Servers ?

NEXCOM – Low voltage blade server200 low voltage Intel XEON CPUs (1.6 GHz – 30W)in a 42U RackIntegrated Gbit Ethernet network

Mellanox – Infiniband blade server

Single XEON Blades connected via a 10 Gbit (4X) Infinibandnetwork

MEGWARE, NCSA, Ohio State University

Page 19: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 19

Conclusions

PC CPUs have an extremely high sustained LQCD performance using SSE/SSE2 (SIMD+pre-fetch), assuming a sufficient large local lattice

Bottlenecks are the memory throughput and the external I/O bandwidth, both components are improving

(Chipsets: i860 E7500 E705 …,

FSB: 400MHz 533 MHz 667 MHz …,

external I/O: Gbit-Ethernet Myrinet2000 QSnet Inifiniband …)

Non-blocking MPI communication can improve the performance by using adequate MPI implementations (e.g. ParaStation)

32-bit Architectures (e.g. IA32) have a much better price performance ratio than 64-bit architectures (Itanium, Opteron ?)

Large low voltage dense blade clusters could play an important role in LQCD computing (low voltage XEON, CENTRINO ?, …)

Page 20: Peter Wegner, DESY CHEP03, 25 March 2003 1 LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig

Peter Wegner, DESY CHEP03, 25 March 2003 20

Acknowledgements

Acknowledgements

We would like to thank Martin Lüscher (CERN) for the benchmark codes and the fruitful discussions about PCs for LQCD,

and

Isabel Campos Plasencia (Leibnitz-Rechenzentrum Munich), Gerhard Wellein (Uni Erlangen), Holger Müller (Megware), Norbert Eicker (Par-Tec), Chris Eddington (Mellanox) for the opportunity to run the benchmarks on their clusters.