glidein factory operations

12
glideinWMS training G.Factroy Operations 1 glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD)

Upload: igor-sfiligoi

Post on 15-Jan-2015

166 views

Category:

Technology


1 download

DESCRIPTION

What it takes to operate a glideinWMS glidein factory. The OSG experience.

TRANSCRIPT

Page 1: Glidein Factory Operations

glideinWMS training G.Factroy Operations 1

glideinWMS training

Glidein Factory Operationsi.e. How we spend our time?

by Igor Sfiligoi (UCSD)

Page 2: Glidein Factory Operations

glideinWMS training G.Factroy Operations 2

G. Factory Operation Categories

● Factory node operations

● Serving VO Frontend Admin requests

● Keeping up with changes in the Grid

● Debugging Grid problems

Page 3: Glidein Factory Operations

glideinWMS training G.Factroy Operations 3

G. Factory Operation Ongoing Costs

● Factory node operations● Pretty much runs itself, unexpected <1day/month

● Serving VO Frontend Admin requests● Highly variable, average a few hours/week

● Keeping up with changes in the Grid● Variable, currently O(10 hours)/week

● Debugging Grid problems● More than we have effort for!

Better tools coulddrastically reduce this

Page 4: Glidein Factory Operations

glideinWMS training G.Factroy Operations 4

Factory node ops

● The factory mostly just runs● Occasional upgrade of SW needed,

but typically fast and painless

● Most effort going into investigating unexpected behavior, e.g.● High load● Weird problems after a reboot/OS upgrade

● Of course, installing a new node can take significant time● But a very rare event

O(hours)/

month

Page 5: Glidein Factory Operations

glideinWMS training G.Factroy Operations 5

VO FE Admin requests

● Adding a new VO FE can be expensive● Apart from config changes, to help them start running● However, relatively rare to have new VOs

● In steady state, VOs may request● New sites● New attributes

● g.Factory operators also mustassist with debugging FE config changes● Error logs come only to GF (currently)

O(hours)/

week

Page 6: Glidein Factory Operations

glideinWMS training G.Factroy Operations 6

Following changes in the Grid

● G.Factory operational principle is trust-but-verify● G.Factory admins must approve any change

in the G.Factory config● Grid a very dynamic place

● At least one site makes a change every single day● Mostly complaint driven,

have no good tools to automate change discovery● G.Factory admins thus must

change the G.Factory config often● Currently mostly a manual process

Better tools would bewelcome

O(10 hours) / week

Page 7: Glidein Factory Operations

glideinWMS training G.Factroy Operations 7

Grid debugging 1/2

● With O(50k) glideins running at any time,we always find something broken somewhere

● Full spectrum of errors● Broken worker nodes (validation errors)● Broken CEs (authentication/startup/monitor errors)● Network problems (glideins not registering)

● Mostly cannot directly solve the problem(s)● i.e. have to notify remote Admins● But we have to discover the root cause to get it solved

Page 8: Glidein Factory Operations

glideinWMS training G.Factroy Operations 8

Grid debugging 2/2

● Grid a difficult place to debug● Most sites are black boxes for us

● Luckily, glideins provide lots of info in the logs● When we get them... a broken site

may not return anything useful, or anything at all

● Prodding the black box often needed● Which is hard!● And some problems

may be VO specific, too

Many FTEsDC, if wehad them

Page 9: Glidein Factory Operations

glideinWMS training G.Factroy Operations 9

What else we do?

● In order to make our life easier, we also● Host a test glideinWMS instance● Develop new helper tools

● The test glideinWMS instance allows usto discover problems early, thus both● Increasing user satisfaction● Reducing the time needed in debugging errors

● We create helper tools to suit our needs● And anything major we contribute back to glideinWMS

Page 10: Glidein Factory Operations

glideinWMS training G.Factroy Operations 10

The test glideinWMS Instance

● The test glideinWMS instance contains both a G.Factory and a VO Frontend● This allows us end-to-end testing

● Major focus on the G.Factory, to test before deploying in production● New SW releases● New sites● New services on existing sites

Page 11: Glidein Factory Operations

glideinWMS training G.Factroy Operations 11

● Operating a G.Factory is much more than keeping the G.Factory service alive● Indeed, this part takes almost a

negligible amount of time

● Most effort going into debugging Grid-related problems● At O(50k) CPUs,

something is always broken somewhere

● Finally, providing expertise to helpVO FE Admins also an essential part of the job

Summary

Page 12: Glidein Factory Operations

glideinWMS training G.Factroy Operations 12

Acknowledgments

● This document was sponsored by grants from the US NSF and US DOE,and by the UC system