llnl computer center overview and operational practicestkwon/course/5315/hw/bg/1...needs and...
TRANSCRIPT
Cupps 11May2005
LLNL Computing Center Overview, Strategy and Operational Practices
Kimberly CuppsDeputy Division Leader,
High Performance Systems Computation
Lawrence Livermore National [email protected]
*Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.
Cupps 11May2005
Overview
• Livermore Computing (LC) Organization
• LC Hardware Components
• LC Strategy
• LC Practices32K Blue Gene/L at LLNL
“BlueGene/L”360 teraFLOPS
Department OfficeMichel McCoy
Terri QuinnBarbara Atkinson
Department AdministratorLinda Foss
Department OfficeMichel McCoy
Terri QuinnBarbara Atkinson
Department AdministratorLinda Foss
HPC Program Leaders
Mary Zosel PSEMark Seager PlatformsSteve Louis DVSDave Wiltzius DisComDoug East CSBrian Carnes SS, M&ICTerri Quinn Institutional Services
HPC Program Leaders
Mary Zosel PSEMark Seager PlatformsSteve Louis DVSDave Wiltzius DisComDoug East CSBrian Carnes SS, M&ICTerri Quinn Institutional Services
Division LeaderDoug East
DeputyKim Cupps
Division LeaderDoug East
DeputyKim Cupps
High Performance Systems
Division LeaderDave Wiltzius
Deputy Currently Vacant
Division LeaderDave Wiltzius
Deputy Currently Vacant
Networks and Services
Division LeaderBrian Carnes
DeputyJean Shuler
Division LeaderBrian Carnes
DeputyJean Shuler
Services and Development
Division Leader Greg Herweg
DeputiesSue Marlais
Robyne Teslich
Division Leader Greg Herweg
DeputiesSue Marlais
Robyne Teslich
Computer Systems and Support
Business ManagerSusan Springer
AssistantPatty Rogers
Business ManagerSusan Springer
AssistantPatty Rogers
Integrated Computing and Communications DepartmentOrganizational Structure
PROGRAMS
DISCIPLINES
ADMINISTRATION
Cupps 11May2005
Organization Components• High Performance Systems Division
– Kernel developers– System administrators– Security software developers– Archive software developers– Operators– Hardware repair team
• Services and Development Division– Hotline staff– Account specialists– Tools developers– Visualization and graphics developers – Visualization hardware specialists
• Networks and Services– Network programmers– Network hardware team
Cupps 11May2005
Facilities: Terascale Simulation Facility (TSF)
Terascale Simulation Facility (TSF)at 253,000 sq. ft., houses the
100-TeraOps-Class Computers
253,000 sq.ft. building comprised of– Two 23,750 sq.ft. (125’ x 190’) unobstructed computer floors reconfigurable as a single computer floor
– 12 MW of computer power, expandable to 15MW; total building power expands to 25MW
– Building office tower houses ~280 computer center support staff
– Data visualization, conferencing and computer and network operations facilities
Cupps 11May2005
Clear span B453 floor will support 500 TF late this year
BlueGene/LASC Purple
Tertiary Storage Silos
NFS DiskMultiple systems Multiple systemsFuture system
This compute capability combined with 12 MW and 47,500 ft2
immediately available for computers is unique in the complex. (Easily
expandable to 25 MW)
Cupps 11May2005
B451, next door, provides significant capacity -currently fields 47+ TF
ASC White (12TF)
Thunder (23TF)HPC Linux
ASC PurpleEarly Delivery (12TF)
This building provides supplementaryInfrastructure including 6+ MW and
20,000 ft2 of high quality space. There are three other buildings siting
systems and support equipment as well
Cupps 11May2005
Platforms 2005
SystemProgra
mManufacturer & Model
Operating System
Interconnect Nodes CPUs
Memory (GB)
Peak GFLOP
Unclassified Network (OCF) 420,660BlueGene/L ASC IBM Linux IBM 65,536 131,072 32,768 367,002Thunder M&IC California Digital CHAOS 2.0 Elan4 1,024 4,096 8,192 22,938MCR M&IC Linux NetworX CHAOS 2.0 Elan3 1,152 2,304 4,608 11,059ALC ASC IBM xSeries CHAOS 2.0 Elan3 960 1,920 3,840 9,216UV (pEDTV) ASC IBM p655 AIX 5.2 Federation 128 1,024 2,048 6,144Frost ASC IBM SP AIX 5.2 Colony DS 68 1,088 1,088 1,632QVC ASC Rackable Systems CHAOS 2.0 96 192 192 1,075iLX M&IC RAND Federal CHAOS 2.0 N/A 67 134 268 678PVC ASC Acme Micro CHAOS 2.0 Elan3 64 128 128 614GPS M&IC Compaq GS320/ES4Tru64 5.1b N/A 49 160 344 277Qbert M&IC Digital 8400 Tru64 5.1b MC 1.5 2 20 24 25
Classified Network (SCF) 134,099Purple ASC IBM SP AIX 5.3 Federation 1,528 12,224 48,896 99,503White ASC IBM SP AIX 5.2 Colony DS 512 8,192 8,192 12,288Lilac (xEDTV) ASC IBM xSeries CHAOS 2.0 Elan3 768 1,536 3,072 9,186UM (pEDTV) ASC IBM p655 AIX 5.2 Federation 128 1,024 2,048 6,144Adelie ASC Linux NetworX CHAOS 2.0 Elan3 128 256 512 1,434Emperor ASC Linux NetworX CHAOS 2.0 Elan3 128 256 512 1,434Ace ASC Rackable Systems CHAOS 2.0 N/A 160 320 640 1,792GViz ASC Rackable Systems CHAOS 2.0 Elan3 64 128 256 717Ice ASC IBM SP AIX 5.2 Colony DS 28 448 448 672Tempest ASC IBM SP AIX 5.2 N/A 12 84 480 555SC Cluster ASC Compaq ES40/ES45 Tru64 5.1b N/A 40 160 384 235Whitecap ASC SGI Onyx3800 Irix 6.5.13f 4 IR3 Pipes 1 96 96 77Tidalwave ASC SGI Onyx2 Irix 6.5.13f 16 IR2 Pipes 1 64 24 38Edgewater ASC SGI Onyx2 Irix 6.5.13f 10 IR2 Pipes 1 40 18 24
Linux 412,582 98%UNIX 8,078 2%
Linux 14,563 11%UNIX 119,536 89%
Capability 400,999 95%Capacity 16,992 4%Serial 980 0%Visualization 1,689 0%
Capability 111,791 83%Capacity 18,870 14%Serial 2,582 2%Visualization 856 1%
Classified
Unclassified
Classified
Unclassified
Cupps 11May2005
What Makes it Possible?
• Planning• Balanced infrastructure• Emphasis on services• Partnerships• Great people!
“BlueGene/L”360 teraFLOPS
Cupps 11May2005
Blueprints translate computational investment into science
• Blueprints are primary planning instruments
• The yearly I/O Blueprint details I/O infrastructure:
– User and platform requirements– Architectures, issues, options– Action plans, deliverables,
schedules and budget• Blueprint focuses on vision and
balanced architectures.
F SN o d e
F SN o d e
F SN o d e
F SN o d e
N F S
L o g inN e t
N F S
L o g inN e t
N F S
L o g inN e t
N F S
L o g inN e t
S A N M o d e l f o r S i t e W i d e G lo b a l F i l e S y s t e m
…
… …
N e tN e tN e t…
H P S S
V i s C l u s t e r
C a p a c i t y C o m p u t e F a r m s
…
…
C a p a b i l i t yP la t f o r m
S t u d i oD i s p l a y s
I n f in ib a n d ™ I /O N e tw o r kIn f in ib a n d ™ o r G b E n e t I /O N e tw o r kIn f in ib a n d ™ I /O N e tw o r kIn f in ib a n d ™ o r G b E n e t I /O N e tw o r k
N e tN e tN e t
S y s te m D a ta , C o n t r o l N e t w o r k sS y s te m D a ta , C o n t r o l N e t w o r k s
C o m p u t eN o d e
C o m p u t eN o d e
C o m p u t eN o d e
C o m p u t eN o d e
Capacity Platforms
Vis
HPSS
280MB/s
20MB/s
20MB/s
70MB/s
20MB/s
20MB/s
480 MB/s
100 MB/s20MB/s20MB/s
1 GB/s
Tri-Lab20MB/s
110MB/s
20MB/s
Figure 11. SCF Throughput Specifications
CapabilityPlatforms
110MB/s
Cupps 11May2005
Blueprint driven deliverables make balanced HPC environments possible
• Archive:– FibreChannel SAN deployment (disk and tape)– 9940B tape and 6M1 mover technology insertion– 1GB/s throughput from capability machines
1,037 MB/s
170 MB/s120 MB/s
9 MB/s6 MB/s4 MB/s1 MB/s
854 MB/s
0
200
400
600
800
1000
1200
FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03
MB/
s
Moved to HPSS
Moved to SP Nodes
Moved to Jumbo GE, Parallel Striping, Faster Disk & Nodes using multiple pftp sessionsNodes
Moved to Faster Disk using multiple Htar sessions on multiple nodes
12/03 Throughput-6M1 Movers
SCF Capability Platform-to-HPSS Performance
Cupps 11May2005
Robust help desk services are essential, despite the rarity of such support at most sites
• Serving customer access, documentation, and training needs
• Comprehensive web presence for users
– Online user manuals – Status and monitoring of HPC systems– Web tutorials
• HPC Help desk available 9x5, Operations – 24x7
–Over 2400 active customers–700 remote customers include:–96% of 13000 Remedy tickets in 2004 closed –OTP tokens distributed for both networks–Dedicated Application Time (DAT) assistance
Cupps 11May2005
Behind the hotline is in-depth, discipline specific customer support
Tools Support• Emphasis on direct work with code
teams
• Group members provide direct support on compilers, math libraries, performance tools, debuggers…
• Developers and researchers provide new capabilities for users.
Visualization and Video Support• Science run support for customers
all across LLNL• PowerWall movie generation • Extensive documentation praised
by users• Examples of specialized support:
– 3D stereo of underground test– Customized tools for DNT,
Chemistry[Figure 4] Startup Performance of 6.1.0-3 on KULL at 128 tas
0
5
10
15
20
25
30
35
40
45
0 bp 1 bp 2 bp 3 bp 4 bp 5 bpNumber of Breakpoints
VanillaA Modification Without TVLinear (A Modification )Linear (Vanilla)
Baneberry Event
Cupps 11May2005
“Science team” visits are one way ICCD identifies with customers
0
20
40
60
80
100
120
LC Customer Interview Issues
Closed
Open
Total
Closed 2 13 33 42
Open 0 0 4 59
Total 2 13 37 101
Level 1 Level 2 Level 3 Request
What is different about thisprocess?
1. All ICCD divisions are present
2. Primary focus on “non-immediate” and on end-to-end issues common to a group
3. Discussion generally centers around issues not reported via usual ticket mechanism
4. Issues are prioritized and then worked
Cupps 11May2005
Alliances strengthen ASC, so we have responded proactively to their requests for cycles and
service
LC a major ASC Alliance resource.• ~700,000 Alliance CPU Hours per month
ASCI Alliance Usage at LLNL
0
100000
200000
300000
400000
500000
600000
700000
800000
Feb 03 Mar 03 Apr 03 May 03 Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 Nov 03 Dec 03 Jan 04
CPU
Hou
r
Blue Frost ALC
Space Shuttle solid rocket booster 52 milliseconds after the igniter is triggered
Mike Campbell– UIUC CSAR –(Center for Simulation of Advanced Rockets): “Having people like you (note: he means Barbara Heron, LC User Services) at the labs makes a huge difference in the experience that people like me have in using these machines, and clearly a huge difference in the amount that we are able to accomplish.
Cupps 11May2005
Other Important Practices
• Management provides clear definition of a machine’s purpose and makes decisions accordingly
Thunder Utilization by IC ProjectsFebruary 2005
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
Dynam
ics
of L
ines
Regio
nal C
limat
e Ch
ange
Nano-
Flui
dic
Devic
esPro
tein
Fol
ding
Firs
t-Prin
cipl
es H
2OSu
perS
quee
zing
Wat
er A
vaila
bilit
yNan
ostr
uctu
res
Chem
ical
Dyn
amic
s
CP
U-h
ou
rs
Allocation
Utilization
Cupps 11May2005
Other Important Practices
• Meetings!– Daily 9 AM staff meetings
• Usually very short, upcoming downtimes, issues discussion– Stakeholders provide input many ways
• Weekly science run meetings on systems yet to be GA– Application developer forum to speak with systems and support – Requests for run time are considered, schedules established– Ongoing issues discussion
• Monthly user meetings– Center planning updates– Most months there is an application talk
• Multiprogrammatic and Institutional Computing (M&IC) Board• Extremely talented, user-focused staff• Many valuable collaborations/partnerships
– Industry – Academia
LC culture: 24/7, user focus, solution oriented
Cupps 11May2005
Summary
• Planning, coordination, infrastructure balance, first-class user services combine to make LC a successful high performance computing center
• Continued success depends on attracting and retaining quality staff, continuing to meet programmatic needs and exceeding expectations
• LLNL is poised to site .5PF this year, on schedule, on budget– This includes siting the world’s two
most powerful computers simultaneously
Solidified TaFred Streitz et. al (LLNL)
16M atoms for 5ns across 32,000 BGL processors in 22 hour continuous run on BGL