checklist for selecting and deploying scalable clusters...
TRANSCRIPT
-
1
Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics
Lloyd Dickman, CTO
InfiniBand ProductsHost Solutions GroupQLogic Corporation
November 13, 2007 @ SC07, Exhibitor Forum
-
2
InfiniBand is the Fabric of Choice for HPC
Large and growing standards-based eco-system
Aggressive technology roadmap
Low-latency, high messaging rate implementations in market
Evidenced by increasing traction in Top500
-
3
InfiniBand Interoperability
-
4
Interdependent Tradeoffs
DecisionsHost ArchitectureFabric TopologyFabric Convergence / UnificationRAS, Power / CoolingSoftware – host applications, ULPsand fabric
Budget
CapEx+
OpEx
Performance
-
5
For HPC, Application Messaging Profiles Matter!
LegacyApplications
NewApplications
Reservoir Simulation
QuantumChromodynamics
Climate
Weather
CFD
Proteomics
Genomics
Magneto Hydrodynamics
MolecularDynamics
Message Size
Mes
sage
Rat
e
EmbarrassinglyParallel
MultiPhysics
Time to Solution
Mo
ore
’s L
aw
Clo
ck R
ate Coarse-Grain
BANDWIDTHDOMINATES
Mo
ore
’s L
aw
Mu
lti-
Co
re
Fine-GrainLATENCY
andMESSAGE RATE
DOMINATE
Ap
pli
cati
on
Acc
ele
rato
rs
-
6
Application Messaging Patterns Change with Scale
Source: PathScale measurements, averages per node, DL_POLY_2 AWE Medium case, 19dec2005
8
16 32 64
128
256
512
1,02
4
2,04
8
4,09
6
8,19
2
16,3
84
32,7
68
65,5
36
131,
072
262,
144
524,
288
3264
128256
5120
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000Nu
mbe
r of M
essa
ges
Message Size (Bytes)
Processors
3264128256512
-
7
QLogic InfiniBand Provides Superior Application Scaling
Scaling results are for 8 nodes / 32 cores.
Latency results include a single switch crossing.
Results use native MPIs: InfiniPath MPI for QLogic
Point-to-point microbenchmarks are no longer sufficient predictors of performance at scale.
Poin
t-to-
Poin
tSc
alin
g
2.6X13 M/s12 M/s9 M/sOSU Message Rate @ 4 processes/nodePoint-to-Point Message Rate(non-coalesced)
.13 GUPs
1.3 µs
1.4 µs
1.9 GB/s
QLE7280 DDR
7.6X
1.8X
Tie
+24%
vsCompetitive
.12 GUPs
1.4 µs
1.5 µs
1.5 GB/s
QLE7240 DDR
.12 GUPs
1.5 µs
1.6 µs
0.9 GB/s
QLE7140 SDR
HPCC Random Ring Latency @ 32 cores
ScalableLatency
OSU Bandwidth(peak)Point-to-Point Bandwidth
HPCC MPI Random Access @ 32 cores
ScalableMessage Rate(non-coalesced)
OSU Latency (0 Byte)
Point-to-PointLatency
BenchmarkComparison
-
8
SPEC MPI2007 Results Confirm Scalability
QLogic SDR to 512 Cores
0
5
10
15
20
25
104_
milc
107_
leslie
3d11
3_Ge
msFD
TD
115_
fds4
121_
pop2
122_
tachy
on12
6_lam
mps
127_
wrf2
128_
GAPg
eofem
129_
tera_
tf13
0_so
corro
132_
zeus
mp2
137_
lu
Rela
tive
Scal
ing
32 64 128 256 512
-
9
Bandwidth Must Account for ALL Traffic
50% Link Utilization
-
10
Core1
Core2
Host-to-Fabric Rails
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
Core1
Core2
Core3
Core4
InfiniBand Data Center Fabric
-
11
InfiniBandData Center
Fabric
InfiniBandData Center
Fabric
Fabric Topology Considerations
Multi-railFat Tree / FBBOversubscriptionAsymmetric Topology
-
12
QLogic InfiniBand Fabric Products
7U
14U
2U
7U
19” rack
4U
288 Ports
144 Ports
96 Ports
48 Ports24 Ports24 Ports1
U
Core Fabric Switches
Edge Fabric Switch
Multi-Protocol
Fabric DirectorModules
4X SDR
4X DDR
SDR and DDR Spines
Production quality DDR since April 2006, Over 50,000 ports to date with extensive OEM qualifications
-
13
Matching Fabric to Node Count
Optimal Fabric Size for today’s 24-port switch chips• 1 tier – 24 ports• 2 tier – 288 ports• 3 tier – 3,456 ports
24 Ports
24 Ports
9020 Switch / FC / 10 GbE Gateway
288 Ports
144 Ports
96 Ports
48 Ports24 Ports
-
14
Advanced Design Capabilities
Signal Integrity• Internal and external bit
lanes capable of 10Gb/sec• Significant signal integrity
challenge• Leveraging established
QLogic center for signal integrity analysis at headquarters
Thermal Analysis• Chip power nearly triple
previous generation• Leveraging QLogic CFD
capabilities
-
15
Unified InfiniBand Fabric Reduces Cost
Typical Cluster Interconnect Unified Fabric Interconnect
AAAAAAAAA
AAAAAAAAA
aa
a
Unified Fabric Deployments
Increasing
-
16
Cost and Power Advantages of Converged Fabric
Cost Savings of a Converged Fabric
$0
$100,000
$200,000
$300,000
$400,000
$500,000
$600,000
Storage $35,200 $144,000 $144,00010GbE $200 $200 $283,400Ethernet $0 $61,280 0IB $113,300 0 0Servers $115,200 $115,200 $115,200
Converged Fabric 1GbE+FC 10GbE+FC
CF cost and power savings • FC HBAs and switches• Ethernet NICs and switches• Cables
Additional cost savings• Simpler management: single
network
Comparisons for a 64-node cluster• CF advantage increases for larger
clusters• 30-38% cost savings on a
400/800-node FSC cluster
Power savings from CF• ~30% vs. GigE
Source: HP BoF, Converged Fabrics, SC07
-
17
Connectivity to ALL StorageDirect Connect
InfiniBand Storage
Fibre ChannelGateway
EtherNetGateway
SRP
iSER
iSER
iSCSI
-
18
Connectivity to ALL Sockets Applications and Ethernet Networks
10 GbEGateway
SDP
EtherNetEmulation
Intra-Cluster Sockets
Communications
VNIC
IPoIB-UDIPoIB-CM
TCP/IP
-
19
OFED+ Provides OpenFabrics BenefitsPlus Optional Value Added Capabilities
OFED Stack
IPTra
nsp
ort
Sta
nd
ard
Lin
ux
TC
P/
IP S
tack
AcceleratedSocketsAccelerated HPC Stack
MP
ICH
2
QLo
gic
MP
I
PSM API
Op
en
MP
I
HP
-MP
I
Sca
li M
PI
MV
AP
ICH
QLogic 7200 Series Adapters
OpenFabrics Verbs
IPoIBSDPVNICSRPiSERRDS
OpenSM
HP
-MP
I
Inte
l M
PI
uDAPL
Lu
stre
QLo
gic
SR
P
QLo
gic
SD
P
QLo
gic
VN
IC
Infi
niP
ath
Fab
ric
Su
ite
MV
AP
ICH
Op
en
MP
I
Sca
li M
PI
OFED Driver
Robust ULPs and Tools
-
20
Large Fabrics Challenges
Scale of InfiniBand deployments growing rapidly• 1000s of Nodes, soon 10,000s of
Nodes
Fabric is the cluster• 1 wire vision being deployed now• Requires high degree of fabric stability
Fabric Diagnosis and Resiliency• Many moving parts at large scale
Demands of End Nodes on SA increasing
-
21
InfiniBand Fabric SuiteAutomates Cluster Install, Verification and Administration
• For both Hosts and Switches
Rich set of tools and capabilities for ongoing Fabric Administration
• Easy to use TUI and CLI• Easy integration into site specific tools
Rapidly Pinpoints fabric errors• Extensive verification and diagnostic capabilities
Industry Leading Performance and Scalability• 1024 node fabric initialization in
-
22
Please Visit Us to Learn More About Complete InfiniBand – Booth #261
Fluent Application running multiple data sets on a converged
fabric with Fibre Channel storage via a gateway
Cluster Administration
using InfiniBand
Fabric Suite
InfiniBand DDR Adapter
for highly scaling
applications
9020 InfiniBand Switch and
Multi-Protocol Gateway
HPCtrackPartnership
Program
Advanced Technology
Demonstrations
-
23
Thank You