the red storm system: architecture, system update and
TRANSCRIPT
The Red Storm System:Architecture, System Update
and Performance AnalysisDouglas Doerfler, Jim TomkinsSandia National Laboratories
Center for Computation, Computers,Information and Mathematics
LACSI 2006Workshop on Performance & Productivity
of Extreme-Scale Parallel SystemsOctober 17th, 2006
Outline
•Architecture & Design•Upgrade: the numbers•Availability & Reliability•Application Performance
Red Storm Architecture
• Balanced System Performance: CPU, Memory, Interconnectand I/O
• Scalability: System Hardware and System Software scale, asingle cabinet system to 32K processor system
• Functional Partitioning: Hardware and System Software• Reliability: Full system Reliability, Availability,
Serviceability (RAS) Designed into Architecture• Upgradeability: Designed in path for system upgrade• Red/Black Switching: Flexible support for both classified
and unclassified Computing in a single system• Custom Packaging: High density, relatively low power
system• Price/Performance: Excellent performance per dollar, use
high volume commodity parts where feasible
Red Storm System (pre-upgrade)
• True MPP, designed to be a single system• Distributed memory MIMD parallel supercomputer• Fully connected 3-D mesh interconnect.• 108 compute node cabinets and 10,368 compute node
processors (AMD Opteron @ 2.0 GHz)• ~30 TB of DDR compute node memory (4 GB, 3 GB, 2 GB)• 8 Service and I/O cabinets on each end (256 processors for
each color)• ~400 TB of disk storage (~200 TB per color)• Less than 2 MW total power and cooling• Less than 3,000 ft2 of floor space
Red Storm System Upgrade
~125 TF~41 TFTeraFLOPs
~78TB (DDR400)~31TB (DDR333)Compute Memory
Seastar V2.1(~doubles HTbandwidth)
Seastar V1.2NIC
AMD Opteron,Dual-Core,@2.4GHz
AMD Opteron @2.0GHz
Processors
320/12,960/320256/10,368/256Nodes/Partition(Red/Center/Black)
27x20x2427x16x243D Mesh (computepartition)
Post-UpgradePre-Upgrade
Red Storm Layout (post upgrade)(27 × 20 × 24 Compute Node Mesh)
Disconnect Cabinets
Switchable Nodes
I/O andService Nodes
I/O andService Nodes
NormallyClassified
NormallyUnclassified
Disk storage systemnot shown
Red Storm System Software• Operating Systems
– LINUX on service and I/O nodes– LWK (Catamount) on compute
nodes– LINUX on RAS nodes
• File Systems– Parallel File System - Lustre– Unix File System- Lustre– NFS v3
• Run-Time System– Logarithmic loader– Node allocator– Batch system – PBS Pro– Libraries – MPI, I/O, Math
• Single System View
• Programming Model– Message Passing: MPI– Support for Heterogeneous
Applications• Tools
– ANSI Standard CompilersFortran, C, C++: PGI
– Debugger: TotalView– Performance Monitor:
Cray Apprentice and PAPI• System Management and
Administration– Accounting– RAS GUI Interface for
monitoring system– Single System View
Red Storm System Management and RAS
• RAS Workstations: Cray CMS– Separate and redundant RAS workstations for Red and
Black ends of machine.– System administration and monitoring interface.– Error logging and monitoring for major system
components including processors, memory, NIC/Router,power supplies, fans, disk controllers, and disks.
• RAS Network - Dedicated Ethernet network forconnecting RAS nodes to RAS workstations.
• RAS Nodes– One for each compute board - L0– One for each cabinet - L1
Red Storm Performance: Interconnect and I/O
• Interconnect performance– MPI Latency: Requirement & measured
• Neighbor < 5 µs; measured: 6.0 generic / 3.6 accelerated• Full machine < 8 µs; add ~ 3 µs to above
– Measured MPI Bandwidth• ~ 2,200 MB/sec uni-directional, ~ 4,000 MB/sec bi-directional• Peak HT bandwidth: 3.2 GB/s each direction• Peak Link bandwidth: 3.84 GB/s each direction
– Bi-section bandwidth• ~3.69 TB/s Y by Z; ~4.98 TB/s X by Z; ~8.30 TB/s X by Y (torus)
• I/O system performance– PFS Requirement: 50 GB/s sustained for each color
• Observed 50 GB/s using IOR under “ideal conditions”– External Requirement: 25 GB/s (aggregate) sustained for each
color• Observed 600 MB/s over a single 10GE link
Upgrade PerformanceBefore & After
Upgrade by the Numbers• 40 to 125 TeraFLOPS• 10,368 to 12,960 compute nodes• 512 to 640 service nodes• 2X the SeaStar Interconnect
Bandwidth• 2.0 GHz single-core to 2.4 GHz
dual-core AMD Opteronprocessors
• DDR-333 to DDR-400 memoryspeed
Status• Black Section is in progress• Center Section - Oct ‘06• Red Section - Nov ‘06
• CTH - Shape Charge– Constant work/core– Speedup of at least
1.4 out to 2048sockets
• Sage - timing_c– Constant work/core– At scale, speedup
is at least a factorof 1.6
Upgrade PerformanceSingle-Core vs Dual-Core
CTH
ASC Purple & Red Storm Performance
Sandia's CTH(Shape Charge- 90x216x90 cells/PE)
Execution Time for 100 Cycles
0
500
1000
1500
2000
2500
1 10 100 1000 10000
Number Of Processors
Wa
ll T
ime
, s
ec
s
CTH-Purple
CTH-Red Storm
SEAM Benchmarks (aqua planet)
SEAM = NCAR’s Spectral Element Atmospheric Model, POP = LANL’s Parallel Ocean Program
Red Storm – 5 TF max BG/L – 4 TF max
The Impact of a Balanced Architecture
• Architectural balance with low systemnoise is the key to a scalable platform
Well Balanced Traits Translate toHigh Real World Application Performance