glidein factory operations
DESCRIPTION
What it takes to operate a glideinWMS glidein factory. The OSG experience.TRANSCRIPT
glideinWMS training G.Factroy Operations 1
glideinWMS training
Glidein Factory Operationsi.e. How we spend our time?
by Igor Sfiligoi (UCSD)
glideinWMS training G.Factroy Operations 2
G. Factory Operation Categories
● Factory node operations
● Serving VO Frontend Admin requests
● Keeping up with changes in the Grid
● Debugging Grid problems
glideinWMS training G.Factroy Operations 3
G. Factory Operation Ongoing Costs
● Factory node operations● Pretty much runs itself, unexpected <1day/month
● Serving VO Frontend Admin requests● Highly variable, average a few hours/week
● Keeping up with changes in the Grid● Variable, currently O(10 hours)/week
● Debugging Grid problems● More than we have effort for!
Better tools coulddrastically reduce this
glideinWMS training G.Factroy Operations 4
Factory node ops
● The factory mostly just runs● Occasional upgrade of SW needed,
but typically fast and painless
● Most effort going into investigating unexpected behavior, e.g.● High load● Weird problems after a reboot/OS upgrade
● Of course, installing a new node can take significant time● But a very rare event
O(hours)/
month
glideinWMS training G.Factroy Operations 5
VO FE Admin requests
● Adding a new VO FE can be expensive● Apart from config changes, to help them start running● However, relatively rare to have new VOs
● In steady state, VOs may request● New sites● New attributes
● g.Factory operators also mustassist with debugging FE config changes● Error logs come only to GF (currently)
O(hours)/
week
glideinWMS training G.Factroy Operations 6
Following changes in the Grid
● G.Factory operational principle is trust-but-verify● G.Factory admins must approve any change
in the G.Factory config● Grid a very dynamic place
● At least one site makes a change every single day● Mostly complaint driven,
have no good tools to automate change discovery● G.Factory admins thus must
change the G.Factory config often● Currently mostly a manual process
Better tools would bewelcome
O(10 hours) / week
glideinWMS training G.Factroy Operations 7
Grid debugging 1/2
● With O(50k) glideins running at any time,we always find something broken somewhere
● Full spectrum of errors● Broken worker nodes (validation errors)● Broken CEs (authentication/startup/monitor errors)● Network problems (glideins not registering)
● Mostly cannot directly solve the problem(s)● i.e. have to notify remote Admins● But we have to discover the root cause to get it solved
glideinWMS training G.Factroy Operations 8
Grid debugging 2/2
● Grid a difficult place to debug● Most sites are black boxes for us
● Luckily, glideins provide lots of info in the logs● When we get them... a broken site
may not return anything useful, or anything at all
● Prodding the black box often needed● Which is hard!● And some problems
may be VO specific, too
Many FTEsDC, if wehad them
glideinWMS training G.Factroy Operations 9
What else we do?
● In order to make our life easier, we also● Host a test glideinWMS instance● Develop new helper tools
● The test glideinWMS instance allows usto discover problems early, thus both● Increasing user satisfaction● Reducing the time needed in debugging errors
● We create helper tools to suit our needs● And anything major we contribute back to glideinWMS
glideinWMS training G.Factroy Operations 10
The test glideinWMS Instance
● The test glideinWMS instance contains both a G.Factory and a VO Frontend● This allows us end-to-end testing
● Major focus on the G.Factory, to test before deploying in production● New SW releases● New sites● New services on existing sites
glideinWMS training G.Factroy Operations 11
● Operating a G.Factory is much more than keeping the G.Factory service alive● Indeed, this part takes almost a
negligible amount of time
● Most effort going into debugging Grid-related problems● At O(50k) CPUs,
something is always broken somewhere
● Finally, providing expertise to helpVO FE Admins also an essential part of the job
Summary
glideinWMS training G.Factroy Operations 12
Acknowledgments
● This document was sponsored by grants from the US NSF and US DOE,and by the UC system