building a grid cluster from the ground up · building a grid cluster from the ground up a tale of...

18
ScotGrid EGI CF 2013 Building a grid cluster from the ground up A Tale of Two Rooms

Upload: others

Post on 30-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Building a grid cluster from the ground up

A Tale of Two Rooms

Page 2: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Introduction

• Scotgrid Glasgow [GridPP]

• One of largest Tier 2 sites in UK NGI

• 4136 cores

• 1.3 PB online storage

Page 3: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

A year on the grid

• Power & A/C outages from multiple causes on different scales

• Trips/larger substation drops etc.

• Lessons learned - general good practice

• General thoughts on living with a cluster

Page 4: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

The case of two machine rooms

• Different ages of rooms - repurposing

• Different cooling solutions

• Advantages and disadvantages• In principle, with redundant links could have cluster

redundancy

• In reality, complexity from bridging cluster with that redundancy - where are the bottlenecks

Page 5: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Site diagram80 Gb/s

X460-48t

X460-48t

X460-48t

X460-48t

X460-48t

X460-48tX460-48t

Summit X670V

Summit X670V

Summit X670V

X460-48tX460-48tX460-48tX460-48tX460-48t

Worker Nodes

10G WN

10G Servers 10G Disk

10G Disk

10G Servers

Worker NodesServers

Disk

Upper

Lower

Servers

WAN

10 Gb/s

1 Gb/s multiple

10 Gb/smultiple

Summit X670V

Page 6: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Power & A/C failures

• Can happen to anyone

• Expect failure (like the grid philosophy)

• UPSes are very useful• Except when they’re not

• Complexities of multi-room cluster

Page 7: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Best case

• One large data centre with ample power, cooling and network infrastructure

• Lower maintenance overheads

• Higher production uptime

• Failure prediction and multiple redundancy

Page 8: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Reality

• Many clusters grow organically over time, even with careful planning

• Periodic capacity upgrades can lead to infrastructure difficulties

Page 9: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Essential Cluster Infrastructure

• Power

• Cooling

• Network

Page 10: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Power

• Clusters are not desktops - be mindful of total power draw

• Potential for mix of 3 phase &13A ring main

• Most likely to impact overall user environment if changes have to be made (whole building outages)

• Don’t mix phases within rack

• Make clear about which phases are where

Page 11: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Cooling

• Mix of techniques (in our case)

• Compressors - gradual degradation• 4 AHUs: 4 x 2 compressors

• Liquid cooling • 3 AHUs: effectively 1 active chiller (with failover)

• Over-specification• Expect maintenance downtime

Page 12: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Network

• An aside (not power or A/C)

• Networking now a first class citizen

• Disparate vendors -> Unified structure

• 160 Gbps backbone• 80 Gbps redundant ring

Page 13: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Site diagram80 Gb/s

X460-48t

X460-48t

X460-48t

X460-48t

X460-48t

X460-48tX460-48t

Summit X670V

Summit X670V

Summit X670V

X460-48tX460-48tX460-48tX460-48tX460-48t

Worker Nodes

10G WN

10G Servers 10G Disk

10G Disk

10G Servers

Worker NodesServers

Disk

Upper

Lower

Servers

WAN

10 Gb/s

1 Gb/s multiple

10 Gb/smultiple

Summit X670V

Page 14: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Best practices

• Cold starts & boot order

• Auto power on?

• Alerts for sysadmins

• Notifications & communication

• Single points of failure - startup critical path

Page 15: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Startup procedures

• Critical path• Core infrastructure

• Core services • (NFS Master Services pool nodes DPM WN)

• More speed less haste

• Automation

• Cluster management

Page 16: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

A user’s perspective• Depending on the size of the cluster, power

and A/C concerns can have a major impact on users.

• Communication

• Notification

• Posted maintenance windows

• Postmortem

Page 17: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Process flow

• Logging• Preventative maintenance

• Event flow

• Postmortem

• Process revision

• Escalation

Page 18: Building a grid cluster from the ground up · Building a grid cluster from the ground up A Tale of Two Rooms. ScotGrid EGI CF 2013 Introduction • Scotgrid Glasgow [GridPP] • One

ScotGrid

EGI CF 2013

Summary

• Cluster environment is very often externally dictated

• Organic growth

• Can happen to anyone

• Process