the cern agile infrastructure project: configuration and operations tools
DESCRIPTION
The CERN Agile Infrastructure Project: Configuration and Operations Tools. Helge Meinhard / CERN-IT (replacing Manuel Guijarro ) HEPiX Spring 2012 24 April 2012, Praha. Configuration and Operations Tools. https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure - PowerPoint PPT PresentationTRANSCRIPT
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
The CERNAgile Infrastructure Project:
Configuration and Operations Tools
Helge Meinhard / CERN-IT(replacing Manuel Guijarro)
HEPiX Spring 201224 April 2012, Praha
Configuration and Operations Tools
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://agileinf.cern.ch/jira/
Agile Infrastructure - Configuration and Operation Tools
Project Scope
The project is reviewing the entire CERN computer-centre management toolset– What happens from the bare metal up– Asset management, inventory– Sysadmin tools and maintenance workflows– Service management and configuration tools– Dynamic configuration for ‘virtual’ hosts– Operations monitoring– Workflow automation and continuous deployment– …
Agile Infrastructure - Configuration and Operation Tools
Configuration and Operations Tools
Agile Infrastructure - Configuration and Operation Tools
Why?
Current production system built around the Quattor toolset is successfully managing O(10k) servers– (CERN) Quattor + many CERN components
Why are we changing the toolset?
Agile Infrastructure - Configuration and Operation Tools
What are the Issues (1)
Uncompressible technical debt– The cost to develop and maintain our own solution is not reducing
and clearly exceeds our resources– Small community (less funding) and general support problem. At
CERN, we’ve fallen into the “sticky hands” support model
We need better automation and integration between the sub-components– Lack of automated workflow: everything is a ticket
emailScript™ : your added value in the process is often your CERN password
– The 15-min “CDB commit walk” – context switch cost
Agile Infrastructure - Configuration and Operation Tools
What are the Issues (2)
Transferrable skills and training– Learning curve for our tools is steep and remains high
– It’s easier to hire people who have skills in a widely-used tool than your internal tools
Depending on where you look
Agile Infrastructure - Configuration and Operation Tools
Jobs Adverts – indeed.com
Agile Infrastructure - Configuration and Operation Tools
Index of millions of worldwide job posts across thousands of job sites
These are the sort of posts our departing staff will be applying for.
Puppet
Quattor
Integration is Hard
IPv6, virtualisation, Windows Server all need a solution– We could leverage lots of open source tools
But piecemeal integration of these requires high investment due to our complex system
Years of organic growth have made the system way too ‘hairy’ It’s often easier to reinvent rather than integrate
– Lack of ‘dynamic-ness’ in the infrastructure We hack the config system for dynamic VMs
It’s critical to look at the system as a whole
Agile Infrastructure - Configuration and Operation Tools
Use Puppet for the Core
The tool space has exploded in the last few years– In configuration management and ops– Large, shared ‘tool forges’, and lots of experience
Puppet and Chef are the clear leaders for the ‘core’ tool– other tools in our ‘scope’ try to integrate with those
Many large-scale enterprises use Puppet– Its declarative approach fits better with what we are used to – Large installations: friendly, wide-base community and commercial support
and training– You can buy books on it
Agile Infrastructure - Configuration and Operation Tools
Scaling Challenges: Nodes
Currently we have O(10k) physical nodes IaaS approach:
– Moving to virtual machines– More (smaller, load-balanced) service nodes– VMs for raw compute (batch or pilot jobs)– Homogeneous: compute + storage on the same node
Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed– 99.6%(1) node update success-rate means 1200 manual interventions to “fix
it”
(1) in a recent intervention on lxbatch
Agile Infrastructure - Configuration and Operation Tools
Scaling Challenges: People
Many, diverse applications (“clusters”) managed by different teams..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system
Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1st Try (1)
First started investigating tools in September 2011 using ‘part-time’ resources from several IT groups– Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint
review, visible– Take first, best-guess at architecture and tool selection, iterate
Mixed success with this agile style– What works: Good visibility and reviews.
Daily “scrum” meeting useful. Weekly review meeting open to management.
– What doesn’t: The “time boxing” part of Scrum sprints is hard with part-time resources
– Now more staff available, but still mostly part-time efforts
Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1st Try (2)
We’re currently running:– OpenStack as cloud software for virtual machines, image management, bulk
storage See later presentation
– Puppet for the configuration management core– …with Foreman as a dashboard
Agile Infrastructure - Configuration and Operation Tools
Foreman Dashboard
Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1st Try (2)
We’re currently running:– OpenStack as cloud software for virtual machines, image management, bulk
storage See later presentation
– Puppet for the configuration management core– …with Foreman as a dashboard
None of the tools are “perfect” out-of-the-box– .. but we’d rather submit patches to a good open source tool than re-implement it– We’ve experienced very good community support: RFCs and patches are quickly
accepted– Very active community: often problems are fixed and missing features implemented
before you even report them
Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1st Try (3)
We’re currently running:– yum for software distribution (replacing spma)– git for template management: why git?
Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates
Many of the tools we can benefit from also assume git We should not be different from the rest of the community
Agile Infrastructure - Configuration and Operation Tools
Puppet
Client/server architecture– “puppetmaster”: horizontally scalable Rails application– X509 cert authenticated nodes: integrate with CERN CA
Agile Infrastructure - Configuration and Operation Tools
Puppet
Puppet runs on the client, applyingthe configuration changes
– It detects the current state and only runs if there’s something to do
It runs every few minutes– new configuration will be ~immediately applied (“fail-fast”).– This is a change from CDB where ‘latent’ changes can be stacked up
Normal mode is client-side compile (“assume success”)– No more CDB commit waits– Change from CDB: the compilation fails later
Good monitoring is a pre-req: puppet sends reports back to the puppetmaster
– The Foreman tool can collect these for you
Agile Infrastructure - Configuration and Operation Tools
Puppet Language
Puppet uses its own Ruby-like language for the templatesto “assert” the desired state of the nodes– With Ruby fall-back for hard stuff (we’ve only needed this once)
Being declarative rather than procedural, there are quirks– Takes a bit of practice to ‘get it’– There are books, online docs, online cook-books, and a large
community to help It dispenses with the need for ncm components
– All the work is done by puppet on the node itself – you just provide the template part to assert what you want done
– Less software -> easier to move to new OS versions
Agile Infrastructure - Configuration and Operation Tools
Externals
Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates
Node function + hardware – Moving a host between clusters is a DB update
Your configuration can use variables the node detects itself– e.g. reconfigure daemons based on where a newly live-migrated VM has found itself
Query the compiled configuration of other hosts– e.g. Open my firewall to the lxadm nodes
Agile Infrastructure - Configuration and Operation Tools
Moving towards PaaS
Parametrisable recipes– Just fill in the blanks
The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template– e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django
server on my box– …with these parameters
Moving us in the PaaS direction– Ultimately, it would be better if you never even needed to log into this node
(J2EE public service, IT web hosting service, MySQL service)
Agile Infrastructure - Configuration and Operation Tools
Standard Workflow
Agile Infrastructure - Configuration and Operation Tools
check outfrom CDB
updatetemplates
CDB commit
run and check on test node
notify with nc-client
n minutes
Iterate
CDB onlxadm
check outfrom git
updatetemplates
git commitand push
run and check on test node
notify with mcollective
1 minute
Iterate
Puppet onlxadm
check outfrom git on
the test node
updatetemplates
run puppet-apply
check on test node
notify with mcollective
Iterate
Puppet-apply on test
node
check onforeman
check onnode(s)
check onforeman
git commitand push
Modernising our Processes (1)
Our software processes for the computer centre are fairly limited– fire-and-forget broadcasts to project-elfms
…and rather manual– The manual test/ -> preprod/ -> prod/ template dance– Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand
Add standard continuous integration (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC– .. then automate the testing – e.g. suitably tagged RPMs are automatically deployed to /test nodes.
Agile Infrastructure - Configuration and Operation Tools
Modernising our Processes (2)
We’re working out which of the many puppet / git models suits us– code review, sign-off and automated notification for changes that will affect
multiple clusters– How to automate the test/preprod/prod advancement
Pre-req is flexible monitoring and alarming– you need to trust that an automation failure will be signaled to you
Script-generated emails are banned– Need good monitoring to hang these notifications on
Integrate components rather than use emailScript™– Script-generated tickets (where your value in the process is your password),
are banned
Agile Infrastructure - Configuration and Operation Tools
Current Tool Snapshot (Liable to Change)
Agile Infrastructure - Configuration and Operation Tools
Jenkins
Koji, Mock
PuppetForeman
AIMS/PXEForeman
Yum repoPulp
Puppet stored config DB
mcollective, yum
JIRA
Lemon
git, SVN
Openstack Nova
Hardware database
Preliminary Timelines
Year What Actions2011 Agree overall principles
2012 Prepare formal project planEstablish IaaS in CERN CCProduction Agile InfrastructureMonitoring Implementation as per WGMigrate lxcloudEarly adopters to Agile Infrastructure
2013 LSD 1New Data Centre
Extend IaaS to remote CCBusiness ContinuitySupport Experiment App re-workMigrate CVIGeneral migration to Agile with SLC6 and Windows 8
2014 LSD 1 (to November) Phase out Quattor/CDB/…
Agile Infrastructure - Configuration and Operation Tools
Aggressive schedule if we are to make it for new data centre
Initial Steps
Decided on tools Integrating them to make a production setup
– We can still change.. But we’re starting to commit…
Looking for early adopters– In particular to understand the people-scaling / ACL issues: which of
the git/puppet models is best? e.g. PES/OIS services: batch/VMs, JIRA, Drupal https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012
– Help with integration / coding– Help with ideas– Help with building the task list
Agile Infrastructure - Configuration and Operation Tools
Summary
IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components– Puppet for the core configuration tool– Better integration between components– Use of more modern software processes to aid deployment– Better monitoring– Engage with the community rather than re-implement
Overall project scope is wider (see following presentations)– Improved monitoring– Cloud and virtualisation
Actively seeking wide involvement from CERN-IT and feedback from the community
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure - Configuration and Operation Tools
Acknowledgements
• Many colleagues at CERN-IT, including – Tim Bell– Ian Bird– Bernd Panzer-Steindel– Gavin McCance– Manuel Guijarro
Agile Infrastructure Making IT operations better since 2013
Jenkins
Openstack
Koji
ActiveMQ
Foreman
Puppet
mcollective
git