20120524 cern data centre evolution v2

32
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERN Infrastructure Evolution Tim Bell CHEP 24 th May 2012

Upload: tim-bell

Post on 05-Jul-2015

3.157 views

Category:

Technology


0 download

DESCRIPTION

CHEP 2012 Presentation on CERN infrastructure evolution

TRANSCRIPT

Page 1: 20120524 cern data centre evolution v2

CERN IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

CERN Infrastructure Evolution

Tim Bell

CHEP

24th May 2012

Page 2: 20120524 cern data centre evolution v2

Agenda

• Problems to address

• Ongoing projects– Data centre expansion

– Configuration management– Infrastructure as a Service– Monitoring

• Timelines

• Summary

CERN Infrastructure Evolution 2

Page 3: 20120524 cern data centre evolution v2

Problems to address

• CERN data centre is reaching its limits

• IT staff numbers remain fixed but more computing capacity is needed

• Tools are high maintenance and becoming increasingly brittle

• Inefficiencies exist but root cause cannot be easily identified

CERN Infrastructure Evolution 3

Page 4: 20120524 cern data centre evolution v2

Data centre extension history

• Studies in 2008 into building a new computer centre on the CERN Prevessin site– Too costly

• In 2011, tender run across CERN member states for remote hosting– 16 bids

• In March 2012, Wigner Institute in Budapest, Hungary selected

CERN Infrastructure Evolution 4

Page 5: 20120524 cern data centre evolution v2

Data Centre Selection

• Wigner Institute in Budapest, Hungary

CERN Infrastructure Evolution 5

Page 6: 20120524 cern data centre evolution v2

Data Centre Layout & Ramp-up

6CERN Infrastructure Evolution

Page 7: 20120524 cern data centre evolution v2

Data Centre Plans

• Timescales

– Prototyping in 2012

– Testing in 2013

– Production in 2014

• This will be a “hands-off” facility for CERN

– They manage the infrastructure and hands-on work

– We do everything else remotely

• Need to be reasonably dynamic in our usage

– Will pay power bill

• Various possible models of usage – sure to evolve

– The less specific the hardware installed there, the easier to change function

– Disaster Recovery for key services in Meyrin data centre becomes a realistic scenario

CERN Infrastructure Evolution 7

Page 8: 20120524 cern data centre evolution v2

Infrastructure Tools Evolution

• We had to develop our own toolset in 2002• Nowadays,

– CERN compute capacity is no longer leading edge– Many options available for open source fabric

management– We need to scale to meet the upcoming capacity

increase

• If there is a requirement which is not available through an open source tool, we should question the need– If we are the first to need it, contribute it back to the

open source tool

CERN Infrastructure Evolution 8

Page 9: 20120524 cern data centre evolution v2

Where to Look?

• Large community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers and many applications

• Become standard and join this community

Page 10: 20120524 cern data centre evolution v2

Bulk Hardware Deployment

• Adapt processes for minimal physical intervention

• New machines register themselves automatically into network databases and inventory

• Burn-in test uses standard Scientific Linux and production monitoring tools

• Detailed hardware inventory with serial number tracking to support accurate analysis of failuresCERN Infrastructure Evolution 10

Page 11: 20120524 cern data centre evolution v2

Current Configurations

CERN Infrastructure Evolution 11

• Many, diverse applications (“clusters”) managed by different teams• ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit

from a simple configuration system

Page 12: 20120524 cern data centre evolution v2

Job Adverts from Indeed.com

CERN Infrastructure Evolution 12Agile Infrastructure - Configuration and Operation Tools

Index of millions of worldwide job posts across thousands of job sites

These are the sort of posts our departing staff will be applying for.

Puppet

Quattor

Page 13: 20120524 cern data centre evolution v2

Configuration Management

• Using off the shelf components– Puppet – configuration definition– Foreman – GUI and Data store

– Git – version control– Mcollective – remote execution

• Integrated with– CERN Single Sign On– CERN Certificate Authority– Installation Server

CERN Infrastructure Evolution 13

Page 14: 20120524 cern data centre evolution v2

Status

CERN Infrastructure Evolution 14

Page 15: 20120524 cern data centre evolution v2

Different Service Models

CERN Infrastructure Evolution 15

• Pets are given names like lsfmaster.cern.ch

• They are unique, lovingly hand raised and cared for

• When they get ill, you nurse them back to health

• Cattle are given numbers like vm0042.cern.ch

• They are almost identical to other cattle

• When they get ill, you get another one

• Future application architectures tend towards Cattle• .. But users now need support for both modes of working

Page 16: 20120524 cern data centre evolution v2

Infrastructure as a Service

• Goals– Improve repair processes with virtualisation– More efficient use of our hardware– Better tracking of usage

– Enable remote management for new data centre– Support potential new use cases (PaaS, Cloud)– Sustainable support model

• At scale for 2015– 15,000 servers– 90% of hardware virtualized– 300,000 VMs needed

16CERN Agile Infrastructure

Page 17: 20120524 cern data centre evolution v2

OpenStack

• Open source cloud software• Supported by 173 companies including

IBM, RedHat, Rackspace, HP, Cisco, AT&T, …

• Vibrant development community and ecosystem

• Infrastructure as a Service to our scale• Started in 2010 but maturing rapidly

17CERN Agile Infrastructure

Page 18: 20120524 cern data centre evolution v2

Openstack @ CERN

Compute Scheduler

NetworkVolume

Registry Image

KEYSTONEHORIZON

NOVAGLANCE

18CERN Agile Infrastructure

Page 19: 20120524 cern data centre evolution v2

Scheduling and Accounting

• Multiple uses of IaaS– Server consolidation– Classic batch (single or multi-core)– Cloud VMs such as CERNVM

• Scheduling options– Availability zones for disaster recovery– Quality of service options to improve efficiency

such as build machines, public login services– Batch system scalability is likely to be an issue

• Accounting– Use underlying services of IaaS and

Hypervisors for reporting and quotasCERN Infrastructure Evolution 19

Page 20: 20120524 cern data centre evolution v2

Monitoring in IT

• >30 monitoring applications– Number of producers: ~40k– Input data volume: ~280 GB per day

• Covering a wide range of different resources– Hardware, OS, applications, files, jobs, etc.

• Application-specific monitoring solutions– Using different technologies (including commercial

tools)

– Sharing similar needs: aggregate metrics, get alarms, etc

• Limited sharing of monitoring data– Hard to implement complex monitoring queries

20

Page 21: 20120524 cern data centre evolution v2

Monitoring Architecture

CERN Infrastructure Evolution 21

Messaging Broker

StorageConsumer

Producer Sensor

Storage andAnalysis Engine

OperationsTools

OperationsConsumers

ProducerSensor

ProducerSensor

Dashboards and APIs

Apollo

Lemon

Hadoop

Splunk

Page 22: 20120524 cern data centre evolution v2

Current Tool Snapshot (Liable to Change)

Jenkins

Koji, Mock

PuppetForeman

AIMS/PXEForeman

Yum repoPulp

Puppet stored config DB

mcollective, yum

JIRA

Lemon

git, SVN

Openstack Nova

Hardware database

Page 23: 20120524 cern data centre evolution v2

Timelines

Year What Actions

2012 Prepare formal project planEstablish IaaS in CERN Data CentreMonitoring Implementation as per WGMigrate lxcloud usersEarly adopters to use new tools

2013 LS 1New Data Centre

Extend IaaS to remote Data CentreBusiness ContinuityMigrate CVI usersGeneral migration to new tools with SLC6 and Windows 8

2014 LS 1 (to November)

Phase out legacy tools such as Quattor

CERN Infrastructure Evolution 23

Page 24: 20120524 cern data centre evolution v2

Conclusions

• New data centre provides opportunity to expand the Tier-0 capacity

• New tools are required to manage these systems and improve processes given fixed staff levels

• Integrated monitoring is key to identify inefficiencies, bottlenecks and usage

• Moving to an open source tool chain has allowed a rapid proof of concept and will be a more sustainable solution

CERN Infrastructure Evolution 24

Page 25: 20120524 cern data centre evolution v2

Background Information

• HEPiX Agile Talks–http://cern.ch/go/99Ck

• Tier-0 Upgrade–http://cern.ch/go/NN98

Other information, get in touch

[email protected]

CERN Infrastructure Evolution 25

Page 26: 20120524 cern data centre evolution v2

Backup Slides

CERN Infrastructure Evolution 26

Page 27: 20120524 cern data centre evolution v2

CERN Data Centre Numbers

CERN Infrastructure Evolution 27

Systems 7,899 Hard disks 62,023

Processors 14,972 Raw disk capacity (TiB) 62,660

Cores 64,623 Tape capacity (PiB) 47

Memory (TiB) 165 Ethernet 1Gb ports 16,773

Racks 1,070 Ethernet 10Gb ports 622

Power Consumption (KiW) 2,345

From http://sls.cern.ch/sls/service.php?id=CCBYNUM

Page 28: 20120524 cern data centre evolution v2

Computer Centre Capacity

CERN Infrastructure Evolution 28

Page 29: 20120524 cern data centre evolution v2

Capacity to end of LEP

CERN Infrastructure Evolution 29

Page 30: 20120524 cern data centre evolution v2

Foreman Dashboard

Agile Infrastructure - Configuration and Operation Tools

Page 31: 20120524 cern data centre evolution v2

CERN Infrastructure Evolution

Industry Context - Cloud

• Cloud computing models are now standardising– Facilities as a Service – such as Equinix, Safehost– Infrastructure as a Service - Amazon EC2, CVI or lxcloud– Platform as a Service - Microsoft Azure or CERN Web Services– Software as a Service – Salesforce, Google Mail, Service-Now, Indico

• Different customers want access to different layers– Both our users and the IT Service Managers

31

Applications

Platform

Infrastructure

Facilities

Page 32: 20120524 cern data centre evolution v2

SLC 4 to SLC 5 migration

CERN Infrastructure Evolution 32