virtualisation, clouds and iaas at cern helge meinhard cern, it department vtdc delft (nl), 19 june...
TRANSCRIPT
Virtualisation,Clouds and IaaS at CERN
Helge MeinhardCERN, IT Department
VTDC Delft (NL), 19 June 2012
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
2
Outline
• Introduction– CERN– Physics at the LHC– LHC machine and detectors– Data processing challenges– WLCG– CERN computer centre
• Past and present (phase I): CERN virtualisation infrastructure, service consolidation, lxcloud
• Present and future (phase II): remote data centre, new tool suite, IaaS
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
Hubble ALMA
VLTAMS
AtomProton
Big Bang
Radius of Earth
Radius of Galaxies
Earth to Sun
Universe
cmStudy physics laws of first moments after Big Bang increasing Symbiosis between Particle Physics, Astrophysics and Cosmology
Super-Microscope
LHC
Virtualisation, clouds, IaaS - June 2011 4
Enter a New Era in Fundamental ScienceThe Large Hadron Collider (LHC), one of the largest and truly global
scientific projects ever, is the most exciting turning point in particle physics.
Exploration of a new energy frontier
LHC ring:27 km circumference
CMS
ALICE
LHCb
ATLASVirtualisation, clouds, IaaS -
June 2011 5
Virtualisation, clouds, IaaS - June 2011 6
The LHC Computing Challenge
Signal/Noise: 10-13 (10-9 offline) Data volume
High rate * large number of channels * 4 experiments
22 PetaBytes of new data each year Compute power
Event complexity * Nb. events * thousands users
200 k CPUs 45 PB of disk storage
Worldwide analysis & funding Computing funding locally in major
regions & countries Efficient analysis everywhere GRID technology
Worldwide LHC Computing Grid Tier 0: CERN
Data acquisition and initial processing
Data distribution Long-term curation
Tier 1: 11 major centres Managed mass storage Data-heavy analysis Dedicated 10 Gbps
lines to CERN Tier 2: More than 200
centres in more than 30 countries Simulation End-user analysis
Tier 3: from physicists’ desktops to small workgroup cluster Not covered by MoU
Tier3physics
department
Desktop
Germany
USAUK
France
Italy
Taiwan
NordicCountries
Nether-lands
CERN Tier 0
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
grid for a physicsstudy group
SpainTier 1
grid for a regional group
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
9
The CERN Data Centre in Numbers
• Data Centre Operations (Tier 0)– 24x7 operator support and System Administration services to support 24x7
operation of all IT services.– Hardware installation & retirement
• ~7,000 hardware movements/year; ~1800 disk failures/year
– Management and Automation framework for large scale Linux clusters
Fujitsu3%
Hitachi23% HP
0% Maxtor
0% Seagate15%
Western Digital
59%
Other0%
High Speed Routers(640 Mbps → 2.4 Tbps) 24
Ethernet Switches 350
10 Gbps ports 2000
Switching Capacity 4.8 Tbps
1 Gbps ports 16,939
10 Gbps ports 558
Racks 828
Boxes 11,728
Processors 15,694
Cores 64,238
HEPSpec06 482,507
Disks 64,109
Raw disk capacity (TiB) 63,289
Memory modules 56,014
Memory capacity (TiB) 158
RAID controllers 3,749
Tape Drives 160
Tape Cartridges 45000
Tape slots 56000
Tape Capacity (TiB) 34000
IT Power Consumption 2456 KW
Total Power Consumption 3890 KW
Helge Meinhard Virtualisation, clouds, IaaS - June 2011
AMD Opteron 6164 HE6%
Intel Xeon 51500% Intel
Xeon 5160
2% Intel Xeon
E53352% Intel
Xeon E5345
9% Intel Xeon
E54052%
Intel Xeon
E541011%
Intel Xeon
L54208%
Intel Xeon L552045%
Intel Xeon
L56300%
Intel Xeon
L564013%
Intel Xeon X56501%
Intel Xeon X56800%
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
10
Functionality drill-down
• “Clusters” – sets of machines with identical configuration different from other clusters
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
11
Problem Statement (Phase I)
• Small clusters– Far too many clusters and too many managers– Small size of some clusters makes disruptive
upgrades very difficult• OS/software upgrades• HW life cycle management
– Many servers poorly used• Large clusters
– Effective, efficient management is a must• Virtualisation addresses part of these
problemVirtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
12
Phase I: CERN Virtualisation Infrastructure (1)• Addressing the “small cluster” problem• Custom virtual machines in the CERN computer centre
– VMs have a long-term lifetime of months/years
• User kiosk for requesting a VM in less than 30 mins• Based on Microsoft’s System Center Virtual Machine Manager
on top of Hyper-V– Enterprise class centralized management– Rich feature set:
• Allows grouping of hypervisors, with delegation of administrative privileges
• VM migration, High availability• Checkpoints• PowerShell Snap-In for administration / scripting
• Hardware implementation using ‘cells’ of blade servers and redundant iSCSI arrays
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
13
Phase I: CERN Virtualisation Infrastructure (2)• Why SCVMM/Hyper-V?
– Only cost-effective solution for CERN at the time offering required advanced management functionality
• Current status– Checkpointing implemented– Hypervisors upgraded to Win 2008 R2 SP1 – Dynamic memory allocation allowing for overcommitting memory
Virtualisation, clouds, IaaS - June 2011Helge MeinhardMar-10
Apr-10
May-10
Jun-10Jul-1
0
Aug-10
Sep-10
Oct-10
Nov-10
Dec-10
Jan-11
Feb-11
Mar-11
Apr-11
May-11
Jun-11Jul-1
1
Aug-11
Sep-11
Oct-11
Nov-11
Dec-11
Jan-12
Feb-12
0
500
1000
1500
2000
2500
42% Windows VMs
58% Linux VMs
Feb 2012: 2450 VMs on 350 hypervisorsNov 2010: 680 VMs on 170 hypervisors
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
14
Phase I: Service Consolidation
• More than 600 Linux machines in CVI run as fully managed machines for physics services
– Installation, configuration, monitoring
• We offer managed CERN Linux VMs with (some) combinations of:– 1-4 CPUs– 1-8 GB memory– 100-2000 GB disk– 1Gbps paravirtualized network
• 100Mbps during installation
• CPU, disk and network are happily overcommitted– Typical physical CPU usage on hypervisors < 30%– Typical physical network usage < 2%– Real disk usage vs. committed capacity < 20%
• Memory is not overcommitted• Current statistics:
– 627 VMs– 1245 virtual CPUs– 3235 GB memory– 59TB disk used (out of 265TB allocated)
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
15
Phase I: lxcloud (1)
• Addressing the “large cluster” problem• Aims:
– Dynamic provisioning of resources to users of our large batch computing service
• SLC 5 vs. SLC 6• User group-specific customisations of environment
– Test provisioning of a generic cloud interface (EC2) to selected users
• Hardware: O(60) physical batch worker nodes (out of 4000) with local storage
• Fully managed SLC 6 machines with KVM and KSM
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
16
Phase I: lxcloud (2)
• Image repository, image distribution mechanism– Images for virtual batch servers derived from
fully managed “golden nodes”– User-supplied images for EC2 interface– Internal distribution to hypervisors via torrent-
like mechanism– Sharing images across (WLCG) sites discussed
in context of HEPiX• VM provisioning system: OpenNebula 3.2
– Looking at OpenStack (see later)Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
17
Phase II: New challenges
• CERN data centre is reaching its limits• IT staff numbers remain fixed but more
computing capacity is needed• Tools are high maintenance and becoming
increasingly brittle• Inefficiencies exist but root cause cannot
be easily identified
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
18
CERN Data Centre• More capacity needed for the processing of LHC data• For various reasons, not possible to provide additional capacity at CERN• 2010: calls for expression of interest among CERN member states; 2011:
call for tender; 2012: adjudication: Wigner Institute in Budapest/Hungary
• Timescales: Prototyping in 2012, testing in 2013, production in 2014• This will be a “hands-off” facility for CERN
– Only “smart hands” there, everything else done remotely
• Disaster Recovery for key services in primary CERN data centre becomes a realistic scenario
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
19
Usage Model(s)
• Various possible models of usage – sure to evolve– The less specific the hardware installed, the
easier to change function• Vision: Run both massively scaled services
(“kettle”) and carefully set-up special services (“pets”) as virtual machines on top of “kettle” style hypervisors
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
20
Infrastructure Tools Evolution (1)
• We had to develop our own toolset in 2002– Installation, configuration, monitoring
• Nowadays, – CERN compute capacity is no longer leading edge– Many options available for open source fabric management– We need to scale to meet the upcoming capacity increase
• If there is a requirement which is not available through an open source tool, we should question the need– If we are the first to need it, contribute it back to the open source
tool
• Large community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers and many applications– Many small tools for specific purposes linked together
• Easy to exchange one tool with an alternative one
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
21
Infrastructure Tools Evolution (2)
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
22
Infrastructure Tools Evolution (3)
• Configuration management:– Using off the shelf components
• Puppet – configuration definition• Foreman – GUI and Data store• Git – version control• Mcollective – remote execution
– Integrated with• CERN Single Sign On• CERN Certificate Authority• Installation Server
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
23
Infrastructure as a Service
• Goals– Improve repair processes with virtualisation– More efficient use of our hardware– Better tracking of usage– Enable remote management for new data centre– Support potential new use cases (PaaS, Cloud)– Sustainable support model
• At scale for 2015– 15,000 servers– 90% of hardware virtualized– 300,000 VMs needed
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
24
Openstack
• Open source cloud software• Supported by 173 companies including
IBM, RedHat, Rackspace, HP, Cisco, AT&T, …• Vibrant development
community and ecosystem• Infrastructure as a Service to our scale• Started in 2010 but maturing rapidly
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
25
Openstack at CERN (1)
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
Compute Scheduler
NetworkVolume
Registry Image
KEYSTONEHORIZON
NOVA
GLANCE
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
26
Openstack at CERN (2)
• Multiple uses of IaaS– Server consolidation– Classic batch (single or multi-core)– Cloud VMs such as CERNVM
• Scheduling options– Availability zones for disaster recovery– Quality of service options to improve efficiency such as
build machines, public login services– Batch system scalability is likely to be an issue
• Accounting– Use underlying services of IaaS and Hypervisors for
reporting and quotas
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
27
Monitoring
• Action needed– >30 monitoring applications
• Number of producers: ~40k• Input data volume: ~280 GB per day
– Covering a wide range of different resources• Hardware, OS, applications, files, jobs, etc.
– Application-specific monitoring solutions• Using different technologies (including commercial tools)• Sharing similar needs: aggregate metrics, get alarms, etc
– Limited sharing of monitoring data• Hard to implement complex monitoring queries
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
28
Monitoring: New Architecture
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
Messaging Broker
StorageConsumer
Producer Sensor
Storage andAnalysis Engine
OperationsTools
OperationsConsumers
ProducerSensor
ProducerSensor
Dashboards and APIs
Apollo
Lemon
Hadoop
Splunk
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
29
Current tool snapshot(subject to change!)
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
Jenkins
Koji, Mock
PuppetForeman
AIMS/PXEForeman
Yum repoPulp
Puppet stored config DB
mcollective, yum
JIRA
Lemon
git, SVN
Openstack Nova
Hardware database
Timelines
Year What Actions
2012 Prepare formal project planEstablish IaaS in CERN Data CentreMonitoring Implementation as per WGMigrate lxcloud usersEarly adopters to use new tools
2013 LS 1New Data Centre
Extend IaaS to remote Data CentreBusiness ContinuityMigrate CVI usersGeneral migration to new tools with SLC6 and Windows 8
2014 LS 1 (to November)
Phase out legacy tools such as Quattor
Virtualisation, clouds and IaaS - June 2012 30Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
31
Conclusions
• Remote T0 and other challenges require us to re-think the way we run the computer centre and our services
• Virtualisation has proved to be the right way forward (CVI/service consolidation and lxcloud)
• Now unifying on single tool (Openstack) and going much further– Coverage of machines and services– Tool chain for installation, configuration,
monitoring, IaaS– Proof of concept done rapidly, very successful
• People highly motivatedVirtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
32
More information
• HEPiX Agile Infrastructure Talks–http://cern.ch/go/99Ck
• Tier-0 Upgrade–http://cern.ch/go/NN98
• Other info or contacts…Helge.Meinhard (at) cern.ch
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
33
Acknowledgements
• Numerous colleagues and collaborators at CERN, including– Ian Bird– Tim Bell– Gavin McCance– Ulrich Schwickerath– Alexandre Lossent– Jose Castro Leon– Jan van Eldik– Belmiro Moreira
Virtualisation, clouds, IaaS - June 2011Helge Meinhard
Virtualisation, clouds, IaaS - June 2011 34
Jenkins
Koji, Mock
PuppetForeman
AIMS/PXEForeman
Yum repoPulp
Puppet stored
config DB
mcollective, yum
JIRA
Lemon
git, SVN
Openstack Nova
Hardware database
Thank you
Helge Meinhard