llnl computer center overview and operational practicestkwon/course/5315/hw/bg/1...needs and...

18
Cupps 11May2005 LLNL Computing Center Overview, Strategy and Operational Practices Kimberly Cupps Deputy Division Leader, High Performance Systems Computation Lawrence Livermore National Laboratory [email protected] * Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.

Upload: others

Post on 15-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

LLNL Computing Center Overview, Strategy and Operational Practices

Kimberly CuppsDeputy Division Leader,

High Performance Systems Computation

Lawrence Livermore National [email protected]

*Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.

Page 2: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Overview

• Livermore Computing (LC) Organization

• LC Hardware Components

• LC Strategy

• LC Practices32K Blue Gene/L at LLNL

“BlueGene/L”360 teraFLOPS

Page 3: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Department OfficeMichel McCoy

Terri QuinnBarbara Atkinson

Department AdministratorLinda Foss

Department OfficeMichel McCoy

Terri QuinnBarbara Atkinson

Department AdministratorLinda Foss

HPC Program Leaders

Mary Zosel PSEMark Seager PlatformsSteve Louis DVSDave Wiltzius DisComDoug East CSBrian Carnes SS, M&ICTerri Quinn Institutional Services

HPC Program Leaders

Mary Zosel PSEMark Seager PlatformsSteve Louis DVSDave Wiltzius DisComDoug East CSBrian Carnes SS, M&ICTerri Quinn Institutional Services

Division LeaderDoug East

DeputyKim Cupps

Division LeaderDoug East

DeputyKim Cupps

High Performance Systems

Division LeaderDave Wiltzius

Deputy Currently Vacant

Division LeaderDave Wiltzius

Deputy Currently Vacant

Networks and Services

Division LeaderBrian Carnes

DeputyJean Shuler

Division LeaderBrian Carnes

DeputyJean Shuler

Services and Development

Division Leader Greg Herweg

DeputiesSue Marlais

Robyne Teslich

Division Leader Greg Herweg

DeputiesSue Marlais

Robyne Teslich

Computer Systems and Support

Business ManagerSusan Springer

AssistantPatty Rogers

Business ManagerSusan Springer

AssistantPatty Rogers

Integrated Computing and Communications DepartmentOrganizational Structure

PROGRAMS

DISCIPLINES

ADMINISTRATION

Page 4: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Organization Components• High Performance Systems Division

– Kernel developers– System administrators– Security software developers– Archive software developers– Operators– Hardware repair team

• Services and Development Division– Hotline staff– Account specialists– Tools developers– Visualization and graphics developers – Visualization hardware specialists

• Networks and Services– Network programmers– Network hardware team

Page 5: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Facilities: Terascale Simulation Facility (TSF)

Terascale Simulation Facility (TSF)at 253,000 sq. ft., houses the

100-TeraOps-Class Computers

253,000 sq.ft. building comprised of– Two 23,750 sq.ft. (125’ x 190’) unobstructed computer floors reconfigurable as a single computer floor

– 12 MW of computer power, expandable to 15MW; total building power expands to 25MW

– Building office tower houses ~280 computer center support staff

– Data visualization, conferencing and computer and network operations facilities

Page 6: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Clear span B453 floor will support 500 TF late this year

BlueGene/LASC Purple

Tertiary Storage Silos

NFS DiskMultiple systems Multiple systemsFuture system

This compute capability combined with 12 MW and 47,500 ft2

immediately available for computers is unique in the complex. (Easily

expandable to 25 MW)

Page 7: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

B451, next door, provides significant capacity -currently fields 47+ TF

ASC White (12TF)

Thunder (23TF)HPC Linux

ASC PurpleEarly Delivery (12TF)

This building provides supplementaryInfrastructure including 6+ MW and

20,000 ft2 of high quality space. There are three other buildings siting

systems and support equipment as well

Page 8: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Platforms 2005

SystemProgra

mManufacturer & Model

Operating System

Interconnect Nodes CPUs

Memory (GB)

Peak GFLOP

Unclassified Network (OCF) 420,660BlueGene/L ASC IBM Linux IBM 65,536 131,072 32,768 367,002Thunder M&IC California Digital CHAOS 2.0 Elan4 1,024 4,096 8,192 22,938MCR M&IC Linux NetworX CHAOS 2.0 Elan3 1,152 2,304 4,608 11,059ALC ASC IBM xSeries CHAOS 2.0 Elan3 960 1,920 3,840 9,216UV (pEDTV) ASC IBM p655 AIX 5.2 Federation 128 1,024 2,048 6,144Frost ASC IBM SP AIX 5.2 Colony DS 68 1,088 1,088 1,632QVC ASC Rackable Systems CHAOS 2.0 96 192 192 1,075iLX M&IC RAND Federal CHAOS 2.0 N/A 67 134 268 678PVC ASC Acme Micro CHAOS 2.0 Elan3 64 128 128 614GPS M&IC Compaq GS320/ES4Tru64 5.1b N/A 49 160 344 277Qbert M&IC Digital 8400 Tru64 5.1b MC 1.5 2 20 24 25

Classified Network (SCF) 134,099Purple ASC IBM SP AIX 5.3 Federation 1,528 12,224 48,896 99,503White ASC IBM SP AIX 5.2 Colony DS 512 8,192 8,192 12,288Lilac (xEDTV) ASC IBM xSeries CHAOS 2.0 Elan3 768 1,536 3,072 9,186UM (pEDTV) ASC IBM p655 AIX 5.2 Federation 128 1,024 2,048 6,144Adelie ASC Linux NetworX CHAOS 2.0 Elan3 128 256 512 1,434Emperor ASC Linux NetworX CHAOS 2.0 Elan3 128 256 512 1,434Ace ASC Rackable Systems CHAOS 2.0 N/A 160 320 640 1,792GViz ASC Rackable Systems CHAOS 2.0 Elan3 64 128 256 717Ice ASC IBM SP AIX 5.2 Colony DS 28 448 448 672Tempest ASC IBM SP AIX 5.2 N/A 12 84 480 555SC Cluster ASC Compaq ES40/ES45 Tru64 5.1b N/A 40 160 384 235Whitecap ASC SGI Onyx3800 Irix 6.5.13f 4 IR3 Pipes 1 96 96 77Tidalwave ASC SGI Onyx2 Irix 6.5.13f 16 IR2 Pipes 1 64 24 38Edgewater ASC SGI Onyx2 Irix 6.5.13f 10 IR2 Pipes 1 40 18 24

Linux 412,582 98%UNIX 8,078 2%

Linux 14,563 11%UNIX 119,536 89%

Capability 400,999 95%Capacity 16,992 4%Serial 980 0%Visualization 1,689 0%

Capability 111,791 83%Capacity 18,870 14%Serial 2,582 2%Visualization 856 1%

Classified

Unclassified

Classified

Unclassified

Page 9: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

What Makes it Possible?

• Planning• Balanced infrastructure• Emphasis on services• Partnerships• Great people!

“BlueGene/L”360 teraFLOPS

Page 10: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Blueprints translate computational investment into science

• Blueprints are primary planning instruments

• The yearly I/O Blueprint details I/O infrastructure:

– User and platform requirements– Architectures, issues, options– Action plans, deliverables,

schedules and budget• Blueprint focuses on vision and

balanced architectures.

F SN o d e

F SN o d e

F SN o d e

F SN o d e

N F S

L o g inN e t

N F S

L o g inN e t

N F S

L o g inN e t

N F S

L o g inN e t

S A N M o d e l f o r S i t e W i d e G lo b a l F i l e S y s t e m

… …

N e tN e tN e t…

H P S S

V i s C l u s t e r

C a p a c i t y C o m p u t e F a r m s

C a p a b i l i t yP la t f o r m

S t u d i oD i s p l a y s

I n f in ib a n d ™ I /O N e tw o r kIn f in ib a n d ™ o r G b E n e t I /O N e tw o r kIn f in ib a n d ™ I /O N e tw o r kIn f in ib a n d ™ o r G b E n e t I /O N e tw o r k

N e tN e tN e t

S y s te m D a ta , C o n t r o l N e t w o r k sS y s te m D a ta , C o n t r o l N e t w o r k s

C o m p u t eN o d e

C o m p u t eN o d e

C o m p u t eN o d e

C o m p u t eN o d e

Capacity Platforms

Vis

HPSS

280MB/s

20MB/s

20MB/s

70MB/s

20MB/s

20MB/s

480 MB/s

100 MB/s20MB/s20MB/s

1 GB/s

Tri-Lab20MB/s

110MB/s

20MB/s

Figure 11. SCF Throughput Specifications

CapabilityPlatforms

110MB/s

Page 11: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Blueprint driven deliverables make balanced HPC environments possible

• Archive:– FibreChannel SAN deployment (disk and tape)– 9940B tape and 6M1 mover technology insertion– 1GB/s throughput from capability machines

1,037 MB/s

170 MB/s120 MB/s

9 MB/s6 MB/s4 MB/s1 MB/s

854 MB/s

0

200

400

600

800

1000

1200

FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03

MB/

s

Moved to HPSS

Moved to SP Nodes

Moved to Jumbo GE, Parallel Striping, Faster Disk & Nodes using multiple pftp sessionsNodes

Moved to Faster Disk using multiple Htar sessions on multiple nodes

12/03 Throughput-6M1 Movers

SCF Capability Platform-to-HPSS Performance

Page 12: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Robust help desk services are essential, despite the rarity of such support at most sites

• Serving customer access, documentation, and training needs

• Comprehensive web presence for users

– Online user manuals – Status and monitoring of HPC systems– Web tutorials

• HPC Help desk available 9x5, Operations – 24x7

–Over 2400 active customers–700 remote customers include:–96% of 13000 Remedy tickets in 2004 closed –OTP tokens distributed for both networks–Dedicated Application Time (DAT) assistance

Page 13: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Behind the hotline is in-depth, discipline specific customer support

Tools Support• Emphasis on direct work with code

teams

• Group members provide direct support on compilers, math libraries, performance tools, debuggers…

• Developers and researchers provide new capabilities for users.

Visualization and Video Support• Science run support for customers

all across LLNL• PowerWall movie generation • Extensive documentation praised

by users• Examples of specialized support:

– 3D stereo of underground test– Customized tools for DNT,

Chemistry[Figure 4] Startup Performance of 6.1.0-3 on KULL at 128 tas

0

5

10

15

20

25

30

35

40

45

0 bp 1 bp 2 bp 3 bp 4 bp 5 bpNumber of Breakpoints

VanillaA Modification Without TVLinear (A Modification )Linear (Vanilla)

Baneberry Event

Page 14: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

“Science team” visits are one way ICCD identifies with customers

0

20

40

60

80

100

120

LC Customer Interview Issues

Closed

Open

Total

Closed 2 13 33 42

Open 0 0 4 59

Total 2 13 37 101

Level 1 Level 2 Level 3 Request

What is different about thisprocess?

1. All ICCD divisions are present

2. Primary focus on “non-immediate” and on end-to-end issues common to a group

3. Discussion generally centers around issues not reported via usual ticket mechanism

4. Issues are prioritized and then worked

Page 15: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Alliances strengthen ASC, so we have responded proactively to their requests for cycles and

service

LC a major ASC Alliance resource.• ~700,000 Alliance CPU Hours per month

ASCI Alliance Usage at LLNL

0

100000

200000

300000

400000

500000

600000

700000

800000

Feb 03 Mar 03 Apr 03 May 03 Jun 03 Jul 03 Aug 03 Sep 03 Oct 03 Nov 03 Dec 03 Jan 04

CPU

Hou

r

Blue Frost ALC

Space Shuttle solid rocket booster 52 milliseconds after the igniter is triggered

Mike Campbell– UIUC CSAR –(Center for Simulation of Advanced Rockets): “Having people like you (note: he means Barbara Heron, LC User Services) at the labs makes a huge difference in the experience that people like me have in using these machines, and clearly a huge difference in the amount that we are able to accomplish.

Page 16: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Other Important Practices

• Management provides clear definition of a machine’s purpose and makes decisions accordingly

Thunder Utilization by IC ProjectsFebruary 2005

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Dynam

ics

of L

ines

Regio

nal C

limat

e Ch

ange

Nano-

Flui

dic

Devic

esPro

tein

Fol

ding

Firs

t-Prin

cipl

es H

2OSu

perS

quee

zing

Wat

er A

vaila

bilit

yNan

ostr

uctu

res

Chem

ical

Dyn

amic

s

CP

U-h

ou

rs

Allocation

Utilization

Page 17: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Other Important Practices

• Meetings!– Daily 9 AM staff meetings

• Usually very short, upcoming downtimes, issues discussion– Stakeholders provide input many ways

• Weekly science run meetings on systems yet to be GA– Application developer forum to speak with systems and support – Requests for run time are considered, schedules established– Ongoing issues discussion

• Monthly user meetings– Center planning updates– Most months there is an application talk

• Multiprogrammatic and Institutional Computing (M&IC) Board• Extremely talented, user-focused staff• Many valuable collaborations/partnerships

– Industry – Academia

LC culture: 24/7, user focus, solution oriented

Page 18: LLNL Computer Center Overview and Operational Practicestkwon/course/5315/HW/BG/1...needs and exceeding expectations • LLNL is poised to site .5PF this year, on schedule, on budget

Cupps 11May2005

Summary

• Planning, coordination, infrastructure balance, first-class user services combine to make LC a successful high performance computing center

• Continued success depends on attracting and retaining quality staff, continuing to meet programmatic needs and exceeding expectations

• LLNL is poised to site .5PF this year, on schedule, on budget– This includes siting the world’s two

most powerful computers simultaneously

Solidified TaFred Streitz et. al (LLNL)

16M atoms for 5ns across 32,000 BGL processors in 22 hour continuous run on BGL