ncep hpc transition - ecmwf · ncep hpc transition 15 th ecmwf workshop on the use of hpc in...
TRANSCRIPT
NCEP HPC Transition 15th ECMWF Workshop on the Use of HPC
in Meteorology
Allan Darling Deputy Director, NCEP Central Operations
CURRENT OPERATIONAL CHALLENGE
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 2
3
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
36 Hour Forecast 72 Hour Forecast
NCEP Operational Forecast Skill 36 and 72 Hour Forecasts @ 500 MB over North America
[100 * (1-S1/70) Method]
IBM P690
IBM 701
IBM 704
IBM 7090
IBM 7094
CDC 6600
IBM 360/195
CYBER 205
CRAY Y-MP
CRAY C90
IBM SP
IBM P655+
IBM Power 6
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
NCEP HPC Growth
4
Tera
FLO
P pe
r ope
ratio
nal s
yste
m
1.2
1.8
4.4
14.0 15.5
69.7 73.9
1.0
10.0
100.0
1000.0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Production HWM (02/16/2010)
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
00:0
0:00
00:5
6:00
01:5
2:00
02:4
8:00
03:4
4:00
04:4
0:00
05:3
6:00
06:3
2:00
07:2
8:00
08:2
4:00
09:2
0:00
10:1
6:00
11:1
2:00
12:0
8:00
13:0
4:00
14:0
0:00
14:5
6:00
15:5
2:00
16:4
8:00
17:4
4:00
18:4
0:00
19:3
6:00
20:3
2:00
21:2
8:00
22:2
4:00
23:2
0:00
HPC Production Utilization February 2010
1 Oct 2012 5 15th ECMWF Workshop on the Use of HPC in Meteorology
6 0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
15000
:00
00:2
700
:54
01:2
101
:48
02:1
502
:42
03:0
903
:36
04:0
304
:30
04:5
705
:24
05:5
106
:18
06:4
507
:12
07:3
908
:06
08:3
309
:00
09:2
709
:54
10:2
110
:48
11:1
511
:42
12:0
912
:36
13:0
313
:30
13:5
714
:24
14:5
115
:18
15:4
516
:12
16:3
917
:06
17:3
318
:00
18:2
718
:54
19:2
119
:48
20:1
520
:42
21:0
921
:36
22:0
322
:30
22:5
723
:24
23:5
1
Production 08/27/2012 81.1% Utilized (69.77% Prod/11.33% Dev)
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
7 May 2012
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
8
FY11 On-Time Product Generation
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
2010
1001
2010
1007
2010
1013
2010
1019
2010
1025
2010
1031
2010
1106
2010
1112
2010
1118
2010
1124
2010
1130
2010
1206
2010
1212
2010
1218
2010
1224
2010
1230
2011
0105
2011
0111
2011
0117
2011
0123
2011
0129
2011
0204
2011
0210
2011
0216
2011
0222
2011
0228
2011
0306
2011
0312
2011
0318
2011
0324
2011
0330
2011
0405
2011
0411
2011
0417
2011
0423
2011
0429
2011
0505
2011
0511
2011
0517
2011
0523
2011
0529
2011
0604
2011
0610
2011
0616
2011
0622
2011
0628
2011
0704
2011
0710
2011
0716
2011
0722
2011
0728
99.73% Average On-Time Product Generation - Oct-July 2011
10/6/2010 and 10/7/2010: Job failures
10/27/2010: Storage failure
2/16/2011: Failed node
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
9
FY12 On-Time Product Generation
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
2011
1001
2011
1005
2011
1009
2011
1013
2011
1017
2011
1021
2011
1025
2011
1029
2011
1102
2011
1106
2011
1110
2011
1114
2011
1118
2011
1122
2011
1126
2011
1130
2011
1204
2011
1208
2011
1212
2011
1216
2011
1220
2011
1224
2011
1228
2012
0101
2012
0105
2012
0109
2012
0113
2012
0117
2012
0121
2012
0125
2012
0129
2012
0202
2012
0206
2012
0210
2012
0214
2012
0218
2012
0222
2012
0226
2012
0301
2012
0305
2012
0309
2012
0313
2012
0317
2012
0321
2012
0325
2012
0329
2012
0402
2012
0406
2012
0410
2012
0414
2012
0418
2012
0422
2012
0426
2012
0430
2012
0504
2012
0508
2012
0512
2012
0516
2012
0520
2012
0524
2012
0528
2012
0601
2012
0605
2012
0609
2012
0613
2012
0617
2012
0621
2012
0625
2012
0629
2012
0703
2012
0707
2012
0711
2012
0715
2012
0719
2012
0723
2012
0727
2012
0731
2012
0804
2012
0808
2012
0812
2012
0816
2012
0820
2012
0824
2012
0828
99.35% Average On-time Product Generation - Oct 2011 - Aug 2012
11/10/2011: Facility power outage, requiring
emergency system switch
2/1/2012: DNS issue
2/6/2012: HW issues leads to emergency system
switch.
2/14/2012 & 2/15/2012: Implementation error
2/16/2012: Hardware issue
3/20/2012: Hardware
10/18/2011 & 10/19/2011:
Implementation error.
2/7/2012, 2/8/2012 & 2/9/2012: System performance issue
12/24/2011 & 12/25/2011: Model failure
2/18/2012: System hardware failure
5/14/2012: Job failure
5/13/2012: Storage failure &
emergency system switch
6/9/2012: Model failure
8/1/2012: Power loss at facility and emergency system switch
8/16/2012 & 8/17/2012:
Hardware repair issue
8/28/2012 & 8/29/2012: System management loading
issue
Impact worsened due to system load
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
CONTRACT & SYSTEM OVERVIEW
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 10
WCOSS Contract Overview
• Sustains current operational systems during build, acceptance and transition to new systems – Bridge contract expires September 2013
• One 5-year base period, one 3-year option period, one 2-year transition option period
• Provides phased delivery of new Primary and Backup systems in the 5-year base period – Will transition from IBM/Power6/AIX to IBM/Intel/Linux by
end of FY13 • Provides new sites
– Primary Reston, VA – Backup Orlando, FL
11 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Significance • Where United States weather forecast
process starts for the protection of lives and livelihood
• Produces model guidance at global, national, and regional scales
Examples: – Hurricane Forecasts – Aviation / Transportation – Air Quality – Fire Weather
• Contract ensures no gap in operations
Bridge Contract Current NOAA Weather and Climate Operational Supercomputer System
Location • Primary
– Gaithersburg, MD (IBM provided facility) • Backup
– Fairmont, WV (GFE NASA IV&V facility) Configuration • Identical Systems (per site)
– IBM Power 6/P575/AIX – 73.9 trillion calculations/sec – 5,314 processing cores – 0.8 petabytes of storage
• Performance Requirements – Minimum 99.0% Operational Use Time – Minimum 99.0% On-time Product Generation – Minimum 99.0% Development Use Time – Minimum 99.0% System Availability – Failover tested regularly
Inputs and Outputs • Processes 3.5 billion observations/day • Produces over 15 million products/day 12
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Significance • Where United States weather forecast
process starts for the protection of lives and livelihood
• Produces model guidance at global, national, and regional scales
Examples: – Hurricane Forecasts – Aviation / Transportation – Air Quality – Fire Weather
WCOSS Contract Future NOAA Weather and Climate Operational Supercomputer System
Location • Primary
– Reston, VA (IBM provided facility) • Backup
– Orlando, FL (IBM provided facility) Configuration • Identical Systems (per site)
– IBM iDataPlex/Intel Sandy Bridge/Linux – 208 trillion calculations/sec – 10,048 processing cores – 2.59 petabytes of storage
• Performance Requirements – Minimum 99.9% Operational Use Time – Minimum 99.0% On-time Product Generation – Minimum 99.0% Development Use Time – Minimum 99.0% System Availability – Failover tested regularly
Inputs and Outputs • Processes 3.5 billion observations/day • Produces over 15 million products/day
13 1 Oct 2012 15th ECMWF Workshop on the Use of HPC
in Meteorology
Key Performance Metrics
• Bridge – At least 99.0% operational use time
• WCOSS – At least 99.9% operational use time
• WCOSS and Bridge – At least 99.0% system availability
– Sustain on-time generation of model guidance products at a rate of 99% or better; within 15 minutes of target completion times
– No degradation in product delivery to occur during planned, routine failover testing between the Primary and Backup WCOSS
– Delivery of committed computing capability upgrades objectively measured and accepted through the use of benchmark suite
14
WCOSS Locations
15 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
System Characteristics (per site)
Life Cycle Architecture OS Average Capability
Average Capacity Cores TeraFLOP
Bridge System
Oct 2012 - Sep 2013 Power6 AIX 5.3 1.0X 1.0X 5,314 73.9 TF
WCOSS Phase 1
Planned accept Dec 2012 –
FOC Aug 2013 iDataPlex Linux
(RHEL)
*2.0X over
Bridge P6
*2.3X over
Bridge P6 10,048 208 TF
WCOSS Phase 2
Planned accept Dec 2014 –
FOC Jul 2015 iDataPlex Linux
(RHEL)
*1.9X over
Phase 1
*4.4X over
Phase 1 *44,400 *920 TF
16
*Estimated values; assumes FY13 NWS Reallocation Budget
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
System Characteristics – cont. (per site)
Life Cycle Operational Use Time Requirement
Storage Useable Power KW Floor Space
Bridge System
Oct 2012 - Sep 2013 99% 0.80 PB 689 KW 3,100 SF
WCOSS Phase 1
Plan accept Dec 2012 – FOC Aug
2013 99.9% 2.59 PB 469 KW 4,060 SF
WCOSS Phase 2
Plan accept Dec 2014 – FOC Jul
2015 99.9% *7.2PB *1050 KW 4,060 SF
17
* Estimated values; assumes FY13 NWS Reallocation Budget
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
1.2
1.8
4.4
14.0 15.5
69.7 73.9
208
920
1.0
10.0
100.0
1000.0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Projected
WCOSS HPC Growth
18
Built into the contract are capability and storage increases approximately every 3 years with no increase in funding NOAA modeling plans rely heavily on “built-in” contractual increases
Tera
FLO
P pe
r ope
ratio
nal s
yste
m
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
SCHEDULE & TRANSITION
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 19
Date Milestone
Aug 2011 Awarded 2-year Bridge Contract
Sept 2011 Previous WCOSS contract expired
Nov 2011 Awarded new WCOSS ID/IQ Contract
Apr 2012 Installed early access system & started early testing with new architecture
Sep 2012 Complete installation of Primary system (Reston, VA)
Nov 2012 Complete installation of Backup system (Orlando, FL)
Dec 2012 Acceptance of all systems completed
May2013 Achieve Authority to Operate (ATO) for new WCOSS (NOAA8866)
July 2013 Complete transition porting, testing, validation and tuning
Aug2013 Go operational on new WCOSS (Ops on new WCOSS Contract)
Sep 2013 Close out Bridge contract
Bridge Contract, WCOSS Contract and Transition from Bridge to new WCOSS Services
Previous Contract
Bridge Contract
New WCOSS Contract
FY12 FY11 FY13 FY14
Operations Transition activities Major Milestones
20 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Transition End-to-End Validation
Porting, Testing and Source Code Certification - $DEV Branch Chiefs certify
Production Validation - NCO/PMB
Branch certify
Customer Validation -
NCO Director certify
Development Organizations
NCEP Operations & Development Organizations
NCEP Operations & Development Organizations & Customers
- Using the WCOSS Transition Test Checklist & Certification Document:
- Part I:Approval for planned changes
- Part II: Evaluation and certification
- Vertical structure
- Re-compile - Integrate into
production scheduler - Tuning for resources and
timing - Post production/
product generation - Validate output - Downstream testing for
other software systems - Vertical structure
- Dataflow actions - Validate network
capacity - Notify customers - Receive customer
feedback - Make decisions and
changes based on customer feedback
Operational products equivalence
New baseline of software systems and applications
GO LIVE
21 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
SYSTEM DETAILS
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 22
Phase 1 and 2 Software Overview
• Red Hat Linux OS • IBM xCAT • IBM Parallel Environment including runtime and developer editions • Platform Computing Load Sharing Facility (LSF) • ECMWF ecFlow Workflow Manager • IBM General Parallel Filesystem (GPFS) • Intel Cluster Studio • Rogue Wave TotalView Debugger • IDL • Community Supported Software, e.g., BUFR, GRIB, netCDF,
GEMPAK, GIS, GoogleMaps, GrADS, NCAR Graphics, VIS5D, Subversion
23 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
System Architecture
• IBM dx360M4 iDataPlex server based on the Intel Sandy Bridge Processor
• Clustered with a Mellanox InfiniBand FDR fabric • Cluster components include the compute nodes, login
nodes, high-performance interconnect, and the management infrastructure
• Configured with “hot spares” for high system availability • Storage based on the IBM DCS3700 disk subsystem and
GPFS • Includes services to address the implementation,
acceptance testing, and training for the transition from the current to new systems
24 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Phase 1 System Architecture
25 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Phase 2 System Architecture
26 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
WRAP-UP
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 27
Transition Key Considerations
• Schedule constraint with high impact consequences – Transition before bridge contract expires in September 2013 – Current operational systems at peak load nearing end of life
• Resource dependencies external to investment
– Transition highly dependent on resources from the NOAA Data Assimilation & Modeling/Science investment
• Technical
– NOAA transitioning operational software systems/applications from legacy systems (Bridge) to a new/different architecture (WCOSS)
– Technical challenges with integration of IBM’s new hardware and system software architecture
28 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
Summary
• System growth curve delayed but resuming and accelerating (for now)
• Model implementations continued at the cost of system recoverability
• Contract architecture allows for agile response to budget changes
• System architecture designed with agility to match contract capability
29 1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology
QUESTIONS [email protected]
WCOSS – NOAA Weather and Climate Operational Supercomputing System
1 Oct 2012 15th ECMWF Workshop on the Use of HPC in Meteorology 30