the red storm system: architecture, system update and

21
The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI 2006 Workshop on Performance & Productivity of Extreme-Scale Parallel Systems October 17th, 2006

Upload: others

Post on 21-Mar-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

The Red Storm System:Architecture, System Update

and Performance AnalysisDouglas Doerfler, Jim TomkinsSandia National Laboratories

Center for Computation, Computers,Information and Mathematics

LACSI 2006Workshop on Performance & Productivity

of Extreme-Scale Parallel SystemsOctober 17th, 2006

Outline

•Architecture & Design•Upgrade: the numbers•Availability & Reliability•Application Performance

Red Storm Architecture

• Balanced System Performance: CPU, Memory, Interconnectand I/O

• Scalability: System Hardware and System Software scale, asingle cabinet system to 32K processor system

• Functional Partitioning: Hardware and System Software• Reliability: Full system Reliability, Availability,

Serviceability (RAS) Designed into Architecture• Upgradeability: Designed in path for system upgrade• Red/Black Switching: Flexible support for both classified

and unclassified Computing in a single system• Custom Packaging: High density, relatively low power

system• Price/Performance: Excellent performance per dollar, use

high volume commodity parts where feasible

Red Storm System (pre-upgrade)

• True MPP, designed to be a single system• Distributed memory MIMD parallel supercomputer• Fully connected 3-D mesh interconnect.• 108 compute node cabinets and 10,368 compute node

processors (AMD Opteron @ 2.0 GHz)• ~30 TB of DDR compute node memory (4 GB, 3 GB, 2 GB)• 8 Service and I/O cabinets on each end (256 processors for

each color)• ~400 TB of disk storage (~200 TB per color)• Less than 2 MW total power and cooling• Less than 3,000 ft2 of floor space

Red Storm System Upgrade

~125 TF~41 TFTeraFLOPs

~78TB (DDR400)~31TB (DDR333)Compute Memory

Seastar V2.1(~doubles HTbandwidth)

Seastar V1.2NIC

AMD Opteron,Dual-Core,@2.4GHz

AMD Opteron @2.0GHz

Processors

320/12,960/320256/10,368/256Nodes/Partition(Red/Center/Black)

27x20x2427x16x243D Mesh (computepartition)

Post-UpgradePre-Upgrade

Red Storm Layout (post upgrade)(27 × 20 × 24 Compute Node Mesh)

Disconnect Cabinets

Switchable Nodes

I/O andService Nodes

I/O andService Nodes

NormallyClassified

NormallyUnclassified

Disk storage systemnot shown

Red Storm System Software• Operating Systems

– LINUX on service and I/O nodes– LWK (Catamount) on compute

nodes– LINUX on RAS nodes

• File Systems– Parallel File System - Lustre– Unix File System- Lustre– NFS v3

• Run-Time System– Logarithmic loader– Node allocator– Batch system – PBS Pro– Libraries – MPI, I/O, Math

• Single System View

• Programming Model– Message Passing: MPI– Support for Heterogeneous

Applications• Tools

– ANSI Standard CompilersFortran, C, C++: PGI

– Debugger: TotalView– Performance Monitor:

Cray Apprentice and PAPI• System Management and

Administration– Accounting– RAS GUI Interface for

monitoring system– Single System View

Red Storm System Management and RAS

• RAS Workstations: Cray CMS– Separate and redundant RAS workstations for Red and

Black ends of machine.– System administration and monitoring interface.– Error logging and monitoring for major system

components including processors, memory, NIC/Router,power supplies, fans, disk controllers, and disks.

• RAS Network - Dedicated Ethernet network forconnecting RAS nodes to RAS workstations.

• RAS Nodes– One for each compute board - L0– One for each cabinet - L1

Red Storm Performance: Interconnect and I/O

• Interconnect performance– MPI Latency: Requirement & measured

• Neighbor < 5 µs; measured: 6.0 generic / 3.6 accelerated• Full machine < 8 µs; add ~ 3 µs to above

– Measured MPI Bandwidth• ~ 2,200 MB/sec uni-directional, ~ 4,000 MB/sec bi-directional• Peak HT bandwidth: 3.2 GB/s each direction• Peak Link bandwidth: 3.84 GB/s each direction

– Bi-section bandwidth• ~3.69 TB/s Y by Z; ~4.98 TB/s X by Z; ~8.30 TB/s X by Y (torus)

• I/O system performance– PFS Requirement: 50 GB/s sustained for each color

• Observed 50 GB/s using IOR under “ideal conditions”– External Requirement: 25 GB/s (aggregate) sustained for each

color• Observed 600 MB/s over a single 10GE link

System Availability

OUO

System Reliability

OUO

Application PerformancePost-Upgrade Preliminary Analysis

Upgrade PerformanceBefore & After

Upgrade by the Numbers• 40 to 125 TeraFLOPS• 10,368 to 12,960 compute nodes• 512 to 640 service nodes• 2X the SeaStar Interconnect

Bandwidth• 2.0 GHz single-core to 2.4 GHz

dual-core AMD Opteronprocessors

• DDR-333 to DDR-400 memoryspeed

Status• Black Section is in progress• Center Section - Oct ‘06• Red Section - Nov ‘06

• CTH - Shape Charge– Constant work/core– Speedup of at least

1.4 out to 2048sockets

• Sage - timing_c– Constant work/core– At scale, speedup

is at least a factorof 1.6

Upgrade PerformanceSingle-Core vs Dual-Core

Application PerformancePre-Upgrade Preliminary Analysis

Updated data point

Sage Scaling (John Daly, LANL)

CTH

ASC Purple & Red Storm Performance

Sandia's CTH(Shape Charge- 90x216x90 cells/PE)

Execution Time for 100 Cycles

0

500

1000

1500

2000

2500

1 10 100 1000 10000

Number Of Processors

Wa

ll T

ime

, s

ec

s

CTH-Purple

CTH-Red Storm

SEAM Benchmarks (aqua planet)

SEAM = NCAR’s Spectral Element Atmospheric Model, POP = LANL’s Parallel Ocean Program

Red Storm – 5 TF max BG/L – 4 TF max

POP Benchmarks ( 1/10 degree Ocean)

The Impact of a Balanced Architecture

• Architectural balance with low systemnoise is the key to a scalable platform

Well Balanced Traits Translate toHigh Real World Application Performance

Conclusions

• Red Storm is an architecture• Red Storm is an instantiation of that architecture• Red Storm has demonstrated excellent scalability

on real applications• The upgrade has shown significant application

speedup (more analysis to come)