resource management: assessing the warm water facilities

18
ORNL is managed by UT-Battelle, LLC for the US Department of Energy Resource Management: Assessing the Warm Water Facilities and Systems Supporting ORNL’s Summit Jim Rogers Director, Computing and Facilities National Center for Computational Sciences Oak Ridge National Laboratory

Upload: others

Post on 24-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resource Management: Assessing the Warm Water Facilities

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Resource Management: Assessing the Warm Water Facilities and Systems Supporting ORNL’s SummitJim Rogers

Director, Computing and FacilitiesNational Center for Computational SciencesOak Ridge National Laboratory

Page 2: Resource Management: Assessing the Warm Water Facilities

22

ORNL Leadership Class Systems 2004 - 2018

2012Cray XK7

Titan

27PF

18.5TF

25 TF

54 TF

62 TF

263 TF

1 PF

2.5PF

2004Cray X1E Phoenix

2005Cray XT3

Jaguar

2006Cray XT3

Jaguar

2007Cray XT4

Jaguar

2008Cray XT4

Jaguar

2008Cray XT5

Jaguar

2009Cray XT5

Jaguar

From 2004 – 2018, HPC systems relied on chiller-based cooling (5.5°C / 42°F

supply) with annualized PUEs to ~1.4

Page 3: Resource Management: Assessing the Warm Water Facilities

33

ORNL’s Transition to Warmer Facility Supply Temperatures

~1.5EF

200PF

27PF

2012Cray XK7

Titan

2021/2022HPE/Cray Frontier

2018/2019IBM

Summit

Titan: Refrigerant-based per-rack cooling with direct rejection of heat to cold 5.5°C water• Below dewpoint• 100% use of chillers

Summit: A combination of direct on-package cooling and RDHX with 21°C / 70°Fsupply is > 95% room-neutral.• Above dewpoint• Contribution by chillers

~20% of the hours of the year

Frontier: Custom packaging is >95% room-neutral with a 32°C / 90°F supply.• ~100% Evaporative Cooling,

with supplemental HVAC for parasitic loads

Page 4: Resource Management: Assessing the Warm Water Facilities

4

Motivation for “Warmer” Cooling Solutions Serving HPC Centers

• Reduced Cost, Both CAPEX and OPEX– Reduce or eliminate the need for traditional

chillers. • No chillers, no ozone-depleting refrigerants (GREEN)

– Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 (ASHRAE Zone 4A – Mixed Humid)

Oak Ridge, TN

– HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.

• Easier, more reliable design– Design is reduced to pumps, evaporative cooling, heat exchangers.– Traditional chilled water may not be necessary at all (NREL, NCAR, et al)

Page 5: Resource Management: Assessing the Warm Water Facilities

55

OLCF Facilities Supporting Summit• Titan – 9MW @ heavy

load• Sitting on 250

pounds/ft2 raised floor• Uses 42F water and

special CDUs (XDPs)

• Summit – 256 compute cabinets on-slab

• 100% room-neutral design uses RDHX

• 20MW warm-water cooling plant using centralized CDU/secondary loop

Page 6: Resource Management: Assessing the Warm Water Facilities

66

• Summit– Demand: 3.3 (idle)-11.5MW;

• Secondary Loop– Supply 3300GPM (12,500

liters/min) @ 21°C; Return @ 29-33°C

– CPUs and GPUs use cold plates– DIMMs and parasitic loads use

RDHX– Storage and Network use RDHX

21°C/20MW/7700 ton Facility System Design System

Page 7: Resource Management: Assessing the Warm Water Facilities

77

Medium Water Temperature Cooling DesignPrimary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)

When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.

The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit

A New Challenge – Titan and rotated off-line, so there is no significant load on the Chillers[1-5]. NOAA moved to the warm water loop (~500GPM demand)

Existing chilled water cooling loop

New primarycooling loop

Page 8: Resource Management: Assessing the Warm Water Facilities

88

Page 9: Resource Management: Assessing the Warm Water Facilities

99

Page 10: Resource Management: Assessing the Warm Water Facilities

10

Benefits of ORNL’s Warm-Water (21°C) Mechanical Plant

Warm Water• Total Capacity to manage 20MW of IT

load

• Warm Water allows annualized PUE of 1.1– For each ~$1M cost per MW-year for

consumption on Summit;– A corresponding ~$100k cost per MW-year

for waste heat management

• Titan was $6-7M/year plus $1.8-$2M to remove the waste heat

• Summit will have a similar “power bill”, but closer to $500k annually for waste heat.

Integrated System/Facility Operation• Integration with the PLC allows us to tune

water pressure and flow– Better delta(t); less pumping energy

• Integration with IBM’s OpenBMC provides information necessary to protect ~37K CPUs and GPUs from inadequate flow across the cold plates

• Integration with the scheduler allows us to correlate power and temperature data with individual applications

Page 11: Resource Management: Assessing the Warm Water Facilities

11

Summit’s Power Demand, Aug 2018– Aug 2019

Early Access, Gordon Bell, Acceptance Testing

Transition to Operations

Full Production

Page 12: Resource Management: Assessing the Warm Water Facilities

12

Detailed Analysis of Summit’s Annual PUE

1.00

1.05

1.10

1.15

1.20

1.25

1.30

0

2000

4000

6000

8000

10000

12000

14000

2018

/08/

0120

18/0

8/11

2018

/08/

2220

18/0

9/02

2018

/09/

1220

18/0

9/23

2018

/10/

0420

18/1

0/14

2018

/10/

2520

18/1

1/05

2018

/11/

1620

18/1

1/26

2018

/12/

0720

18/1

2/18

2018

/12/

2820

19/0

1/08

2019

/01/

1920

19/0

1/30

2019

/02/

0920

19/0

2/20

2019

/03/

0320

19/0

3/13

2019

/03/

2420

19/0

4/04

2019

/04/

1520

19/0

4/25

2019

/05/

0620

19/0

5/17

2019

/05/

2720

19/0

6/07

2019

/06/

1820

19/0

6/28

2019

/07/

0920

19/0

7/20

2019

/07/

3120

19/0

8/10

2019

/08/

21

PUE

Cool

ing

(kW

cm)

Date

Cooling Source August 1 2018-August 31 2019 With 1 Week PUE

CHW CTW 1 Week PUE (kW)

Seasonal Transition (Summer to Fall)

Winter/Spring: 100% Evaporative Cooling

Seasonal Transition (to Summer)

Page 13: Resource Management: Assessing the Warm Water Facilities

1313

Summit’s RM Challenge – Data Volume

13

TOTAL ~470,000 metrics per second available for real-time analytics

• Data Streams Include:

• IBM OpenBMC framework (99 metrics/node/second x 4608 nodes = ~460,000/sec) - OOB

• IBM LSF jobs data for running applications (~10sec update interval)

• NOAA weather/wet bulb for Oak Ridge area (continuous, external)

• Programmable Logic Circuit (PLC) water flow (continuous, protected as part of BAS)

Page 14: Resource Management: Assessing the Warm Water Facilities

14

Sample from Grafana Dashboard (live version shown)

Page 15: Resource Management: Assessing the Warm Water Facilities

15

MTW Performance

Summit (System) Impact:• No substantive impact to CPU or GPU

operating conditions (junction temperatures)

• One degree increase in room temperature (worst case about 72.5° F)

Year over Year, 2018 to 2019:• IT Demand (kW-h) is up 19%• CHW use is 39% LESS than in 2018• The one-degree adjustment of supply

temperature provided a savings of $40,000

2018:Supply 70°F & ~4500GPM 2019:Supply 71°F & ~3300GPM

Energy used by Chilled Water

Total Energy, Mechanical Plant

PUE Improves

Summit IT Load (kW-h) (Cumulative) Increases

Wet Bulb

Page 16: Resource Management: Assessing the Warm Water Facilities

1616

Summit Power/Jobs Every Second

Click toPlayMovie

Page 17: Resource Management: Assessing the Warm Water Facilities

1717

Summit Temperature/Jobs Every Second

Click toPlayMovie

Page 18: Resource Management: Assessing the Warm Water Facilities

1818