ncep hpc transition - ecmwf · ncep hpc transition 15 th ecmwf workshop on the use of hpc in...

30
NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology Allan Darling Deputy Director, NCEP Central Operations

Upload: others

Post on 25-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

NCEP HPC Transition 15th ECMWF Workshop on the Use of HPC

in Meteorology

Allan Darling Deputy Director, NCEP Central Operations

Page 2: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

CURRENT OPERATIONAL CHALLENGE

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 2

Page 3: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

3

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

36 Hour Forecast 72 Hour Forecast

NCEP Operational Forecast Skill 36 and 72 Hour Forecasts @ 500 MB over North America

[100 * (1-S1/70) Method]

IBM P690

IBM 701

IBM 704

IBM 7090

IBM 7094

CDC 6600

IBM 360/195

CYBER 205

CRAY Y-MP

CRAY C90

IBM SP

IBM P655+

IBM Power 6

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 4: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

NCEP HPC Growth

4

Tera

FLO

P pe

r ope

ratio

nal s

yste

m

1.2

1.8

4.4

14.0 15.5

69.7 73.9

1.0

10.0

100.0

1000.0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 5: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Production HWM (02/16/2010)

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

00:0

0:00

00:5

6:00

01:5

2:00

02:4

8:00

03:4

4:00

04:4

0:00

05:3

6:00

06:3

2:00

07:2

8:00

08:2

4:00

09:2

0:00

10:1

6:00

11:1

2:00

12:0

8:00

13:0

4:00

14:0

0:00

14:5

6:00

15:5

2:00

16:4

8:00

17:4

4:00

18:4

0:00

19:3

6:00

20:3

2:00

21:2

8:00

22:2

4:00

23:2

0:00

HPC Production Utilization February 2010

1 Oct 2012 5 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 6: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

6 0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

15000

:00

00:2

700

:54

01:2

101

:48

02:1

502

:42

03:0

903

:36

04:0

304

:30

04:5

705

:24

05:5

106

:18

06:4

507

:12

07:3

908

:06

08:3

309

:00

09:2

709

:54

10:2

110

:48

11:1

511

:42

12:0

912

:36

13:0

313

:30

13:5

714

:24

14:5

115

:18

15:4

516

:12

16:3

917

:06

17:3

318

:00

18:2

718

:54

19:2

119

:48

20:1

520

:42

21:0

921

:36

22:0

322

:30

22:5

723

:24

23:5

1

Production 08/27/2012 81.1% Utilized (69.77% Prod/11.33% Dev)

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 7: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

7 May 2012

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 8: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

8

FY11 On-Time Product Generation

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

100

2010

1001

2010

1007

2010

1013

2010

1019

2010

1025

2010

1031

2010

1106

2010

1112

2010

1118

2010

1124

2010

1130

2010

1206

2010

1212

2010

1218

2010

1224

2010

1230

2011

0105

2011

0111

2011

0117

2011

0123

2011

0129

2011

0204

2011

0210

2011

0216

2011

0222

2011

0228

2011

0306

2011

0312

2011

0318

2011

0324

2011

0330

2011

0405

2011

0411

2011

0417

2011

0423

2011

0429

2011

0505

2011

0511

2011

0517

2011

0523

2011

0529

2011

0604

2011

0610

2011

0616

2011

0622

2011

0628

2011

0704

2011

0710

2011

0716

2011

0722

2011

0728

99.73% Average On-Time Product Generation - Oct-July 2011

10/6/2010 and 10/7/2010: Job failures

10/27/2010: Storage failure

2/16/2011: Failed node

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 9: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

9

FY12 On-Time Product Generation

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

100

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

100

2011

1001

2011

1005

2011

1009

2011

1013

2011

1017

2011

1021

2011

1025

2011

1029

2011

1102

2011

1106

2011

1110

2011

1114

2011

1118

2011

1122

2011

1126

2011

1130

2011

1204

2011

1208

2011

1212

2011

1216

2011

1220

2011

1224

2011

1228

2012

0101

2012

0105

2012

0109

2012

0113

2012

0117

2012

0121

2012

0125

2012

0129

2012

0202

2012

0206

2012

0210

2012

0214

2012

0218

2012

0222

2012

0226

2012

0301

2012

0305

2012

0309

2012

0313

2012

0317

2012

0321

2012

0325

2012

0329

2012

0402

2012

0406

2012

0410

2012

0414

2012

0418

2012

0422

2012

0426

2012

0430

2012

0504

2012

0508

2012

0512

2012

0516

2012

0520

2012

0524

2012

0528

2012

0601

2012

0605

2012

0609

2012

0613

2012

0617

2012

0621

2012

0625

2012

0629

2012

0703

2012

0707

2012

0711

2012

0715

2012

0719

2012

0723

2012

0727

2012

0731

2012

0804

2012

0808

2012

0812

2012

0816

2012

0820

2012

0824

2012

0828

99.35% Average On-time Product Generation - Oct 2011 - Aug 2012

11/10/2011: Facility power outage, requiring

emergency system switch

2/1/2012: DNS issue

2/6/2012: HW issues leads to emergency system

switch.

2/14/2012 & 2/15/2012: Implementation error

2/16/2012: Hardware issue

3/20/2012: Hardware

10/18/2011 & 10/19/2011:

Implementation error.

2/7/2012, 2/8/2012 & 2/9/2012: System performance issue

12/24/2011 & 12/25/2011: Model failure

2/18/2012: System hardware failure

5/14/2012: Job failure

5/13/2012: Storage failure &

emergency system switch

6/9/2012: Model failure

8/1/2012: Power loss at facility and emergency system switch

8/16/2012 & 8/17/2012:

Hardware repair issue

8/28/2012 & 8/29/2012: System management loading

issue

Impact worsened due to system load

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 10: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

CONTRACT & SYSTEM OVERVIEW

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 10

Page 11: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

WCOSS Contract Overview

• Sustains current operational systems during build, acceptance and transition to new systems – Bridge contract expires September 2013

• One 5-year base period, one 3-year option period, one 2-year transition option period

• Provides phased delivery of new Primary and Backup systems in the 5-year base period – Will transition from IBM/Power6/AIX to IBM/Intel/Linux by

end of FY13 • Provides new sites

– Primary Reston, VA – Backup Orlando, FL

11 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 12: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Significance • Where United States weather forecast

process starts for the protection of lives and livelihood

• Produces model guidance at global, national, and regional scales

Examples: – Hurricane Forecasts – Aviation / Transportation – Air Quality – Fire Weather

• Contract ensures no gap in operations

Bridge Contract Current NOAA Weather and Climate Operational Supercomputer System

Location • Primary

– Gaithersburg, MD (IBM provided facility) • Backup

– Fairmont, WV (GFE NASA IV&V facility) Configuration • Identical Systems (per site)

– IBM Power 6/P575/AIX – 73.9 trillion calculations/sec – 5,314 processing cores – 0.8 petabytes of storage

• Performance Requirements – Minimum 99.0% Operational Use Time – Minimum 99.0% On-time Product Generation – Minimum 99.0% Development Use Time – Minimum 99.0% System Availability – Failover tested regularly

Inputs and Outputs • Processes 3.5 billion observations/day • Produces over 15 million products/day 12

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 13: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Significance • Where United States weather forecast

process starts for the protection of lives and livelihood

• Produces model guidance at global, national, and regional scales

Examples: – Hurricane Forecasts – Aviation / Transportation – Air Quality – Fire Weather

WCOSS Contract Future NOAA Weather and Climate Operational Supercomputer System

Location • Primary

– Reston, VA (IBM provided facility) • Backup

– Orlando, FL (IBM provided facility) Configuration • Identical Systems (per site)

– IBM iDataPlex/Intel Sandy Bridge/Linux – 208 trillion calculations/sec – 10,048 processing cores – 2.59 petabytes of storage

• Performance Requirements – Minimum 99.9% Operational Use Time – Minimum 99.0% On-time Product Generation – Minimum 99.0% Development Use Time – Minimum 99.0% System Availability – Failover tested regularly

Inputs and Outputs • Processes 3.5 billion observations/day • Produces over 15 million products/day

13 1 Oct 2012 15th ECMWF Workshop on the Use of HPC

in Meteorology

Page 14: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Key Performance Metrics

• Bridge – At least 99.0% operational use time

• WCOSS – At least 99.9% operational use time

• WCOSS and Bridge – At least 99.0% system availability

– Sustain on-time generation of model guidance products at a rate of 99% or better; within 15 minutes of target completion times

– No degradation in product delivery to occur during planned, routine failover testing between the Primary and Backup WCOSS

– Delivery of committed computing capability upgrades objectively measured and accepted through the use of benchmark suite

14

Page 15: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

WCOSS Locations

15 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 16: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

System Characteristics (per site)

Life Cycle Architecture OS Average Capability

Average Capacity Cores TeraFLOP

Bridge System

Oct 2012 - Sep 2013 Power6 AIX 5.3 1.0X 1.0X 5,314 73.9 TF

WCOSS Phase 1

Planned accept Dec 2012 –

FOC Aug 2013 iDataPlex Linux

(RHEL)

*2.0X over

Bridge P6

*2.3X over

Bridge P6 10,048 208 TF

WCOSS Phase 2

Planned accept Dec 2014 –

FOC Jul 2015 iDataPlex Linux

(RHEL)

*1.9X over

Phase 1

*4.4X over

Phase 1 *44,400 *920 TF

16

*Estimated values; assumes FY13 NWS Reallocation Budget

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 17: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

System Characteristics – cont. (per site)

Life Cycle Operational Use Time Requirement

Storage Useable Power KW Floor Space

Bridge System

Oct 2012 - Sep 2013 99% 0.80 PB 689 KW 3,100 SF

WCOSS Phase 1

Plan accept Dec 2012 – FOC Aug

2013 99.9% 2.59 PB 469 KW 4,060 SF

WCOSS Phase 2

Plan accept Dec 2014 – FOC Jul

2015 99.9% *7.2PB *1050 KW 4,060 SF

17

* Estimated values; assumes FY13 NWS Reallocation Budget

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 18: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

1.2

1.8

4.4

14.0 15.5

69.7 73.9

208

920

1.0

10.0

100.0

1000.0

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Projected

WCOSS HPC Growth

18

Built into the contract are capability and storage increases approximately every 3 years with no increase in funding NOAA modeling plans rely heavily on “built-in” contractual increases

Tera

FLO

P pe

r ope

ratio

nal s

yste

m

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 19: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

SCHEDULE & TRANSITION

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 19

Page 20: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Date Milestone

Aug 2011 Awarded 2-year Bridge Contract

Sept 2011 Previous WCOSS contract expired

Nov 2011 Awarded new WCOSS ID/IQ Contract

Apr 2012 Installed early access system & started early testing with new architecture

Sep 2012 Complete installation of Primary system (Reston, VA)

Nov 2012 Complete installation of Backup system (Orlando, FL)

Dec 2012 Acceptance of all systems completed

May2013 Achieve Authority to Operate (ATO) for new WCOSS (NOAA8866)

July 2013 Complete transition porting, testing, validation and tuning

Aug2013 Go operational on new WCOSS (Ops on new WCOSS Contract)

Sep 2013 Close out Bridge contract

Bridge Contract, WCOSS Contract and Transition from Bridge to new WCOSS Services

Previous Contract

Bridge Contract

New WCOSS Contract

FY12 FY11 FY13 FY14

Operations Transition activities Major Milestones

20 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 21: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Transition End-to-End Validation

Porting, Testing and Source Code Certification - $DEV Branch Chiefs certify

Production Validation - NCO/PMB

Branch certify

Customer Validation -

NCO Director certify

Development Organizations

NCEP Operations & Development Organizations

NCEP Operations & Development Organizations & Customers

- Using the WCOSS Transition Test Checklist & Certification Document:

- Part I:Approval for planned changes

- Part II: Evaluation and certification

- Vertical structure

- Re-compile - Integrate into

production scheduler - Tuning for resources and

timing - Post production/

product generation - Validate output - Downstream testing for

other software systems - Vertical structure

- Dataflow actions - Validate network

capacity - Notify customers - Receive customer

feedback - Make decisions and

changes based on customer feedback

Operational products equivalence

New baseline of software systems and applications

GO LIVE

21 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 22: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

SYSTEM DETAILS

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 22

Page 23: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Phase 1 and 2 Software Overview

• Red Hat Linux OS • IBM xCAT • IBM Parallel Environment including runtime and developer editions • Platform Computing Load Sharing Facility (LSF) • ECMWF ecFlow Workflow Manager • IBM General Parallel Filesystem (GPFS) • Intel Cluster Studio • Rogue Wave TotalView Debugger • IDL • Community Supported Software, e.g., BUFR, GRIB, netCDF,

GEMPAK, GIS, GoogleMaps, GrADS, NCAR Graphics, VIS5D, Subversion

23 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 24: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

System Architecture

• IBM dx360M4 iDataPlex server based on the Intel Sandy Bridge Processor

• Clustered with a Mellanox InfiniBand FDR fabric • Cluster components include the compute nodes, login

nodes, high-performance interconnect, and the management infrastructure

• Configured with “hot spares” for high system availability • Storage based on the IBM DCS3700 disk subsystem and

GPFS • Includes services to address the implementation,

acceptance testing, and training for the transition from the current to new systems

24 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 25: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Phase 1 System Architecture

25 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 26: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Phase 2 System Architecture

26 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 27: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

WRAP-UP

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 27

Page 28: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Transition Key Considerations

• Schedule constraint with high impact consequences – Transition before bridge contract expires in September 2013 – Current operational systems at peak load nearing end of life

• Resource dependencies external to investment

– Transition highly dependent on resources from the NOAA Data Assimilation & Modeling/Science investment

• Technical

– NOAA transitioning operational software systems/applications from legacy systems (Bridge) to a new/different architecture (WCOSS)

– Technical challenges with integration of IBM’s new hardware and system software architecture

28 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 29: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

Summary

• System growth curve delayed but resuming and accelerating (for now)

• Model implementations continued at the cost of system recoverability

• Contract architecture allows for agile response to budget changes

• System architecture designed with agility to match contract capability

29 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology

Page 30: NCEP HPC Transition - ECMWF · NCEP HPC Transition 15 th ECMWF Workshop on the Use of HPC in Meteorology . Allan Darling . Deputy Director, NCEP Central Operations

QUESTIONS [email protected]

WCOSS – NOAA Weather and Climate Operational Supercomputing System

1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 30