lcg deployment workshop summary

LCG deployment workshop summary

Alessandra FortiHEP Sysman meeting

Manchester11 November 2004

Outline

• Status• Operation Summary (Ian Bird)• Fabric Management Summary (Davide

Rebatto)

• Software Management (Steve Traylen)• Operation Security (Dave Kelsey)• User Support (Flavia Donno)

Status

• LCG covers many sites (>80) now – both large and small with different needs– Requires very flexible packaging, installation,

and configuration tools and procedures• There are many problems – but in the end it

seems to work– Middleware is relatively stable and reliable– System is used in production– > 80 sites managed the installation– Now have a basis on which to incrementally build

essential functionality• This infrastructure forms the basis of the

initial EGEE production service

Status

• Data challenges: – LCG-2 system had been used for the first time

during DCs – Significant efforts invested on all sides – very

fruitful collaborations– Adaptations were essential – adapting

experiment software to middleware and vice-versa

– Many problems were recognised and addressed

• Level of complexity anticipated for LHC O(100) sites: most of the operations issues are already appearing

Operations so far

• Existing mode is not sustainable – Monitoring run by (very few) people at CERN, RAL,

Taipei– Tools not yet sufficient

• Many problems reported– Most get bounced to CERN team (or left to them),

but– Several other people quite active in posting and

addressing questions• No control over “bad” sites

– Remove from information system• … But which one?

• User support– Not at all clear where a user should report problems

LCG and EGEE Operations

• EGEE is funded to operate and support a research grid infrastructure in Europe

• The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service– LCG includes US and Asia-Pacific, EGEE includes other sciences

• LCG Deployment Manager is the EGEE Operations Manager – CERN team (Operations Management Centre) provides

coordination, management, and 2nd level support

• Support activities are divided into three levels– Core Infrastructure Centres (CIC) (4)– Regional Operations Centres (ROC) (9) – Resource Centres (sites)

RCs ROCs CICs

• CICs– run services like RBs, Information Indices, VO/VOMS,

Catalogues– are the distributed Grid Operation Center (GOC)– 24/7 support!

• ROCs– coordinate activities in their region– give support to regional RCs – coordinate setup/upgrades – 24/7 support ?

• RC– computing and storage – 24/7 support ???

• Not clear yet the workflow between the different entities

Support Teams within Support Teams within LCG & EGEELCG & EGEE

DeploymentSupport (DS)

Middleware Problems

LHCexperiments(Alice Atlas CMS LHCb)

Other Communities

(VOs), e.g. EGEE

non-LHCexperiments

(BaBar, CDF, Compass, D0)

Operations Center(CIC / GOC / ROC)

Operations Problems

ResourceCenters (RC)Hardware Problems

Experiment Specific User Support (ESUS)VO spec. (Software) Problems

Global Grid User Support (GGUS)Single Point of Contact

Coordination of UserSupport

Operations Working Group

Summary

Ian BirdCERN IT-GD

4 November 2004

CERN IT-GD

Issues

• What is the workflow for operations support?• Who participates – in which roles?• Escalation procedures, agreement to responsibilities / penalties?• How to manage small/bad sites?• What is the daily mode of operations and monitoring?

• “Opsman” handover procedures etc.• Deployment procedures

• What tools are needed to support this?• Who will provide them?

• How to approach “24x7” global operations support?• How this affects workflow; external collaborations

• What is the interaction/interface to user support?• Communication channels?

• Operations weekly meeting?, RSS, IRC, …

• Political level agreements on accounting/info gathering granularity• Milestones (needed for all working groups)

• Concrete set of reasonable milestones • Fit with service challenges; validate the model; monitoring of milestones

• Working groups needed for the longer term?

CERN IT-GD

Model I Strict Hierarchy (modified)

• CICs locates a problem with a RC or CIC in a region• triggered by monitoring/ user alert

• CIC enters the problem into the problem tracking tool and assigns it to a ROC

• ROC receives a notification and works on solving the problem• region decides locally what the ROC can to do on the RCs.

• This can include restarting services etc.• The main emphasis is that the region decides on the depth of the

interaction. • ===> different regions, different procedures

• CICs NEVER contact a site• .====> ROCs need to be staffed all the time

• ROC does it is fully responsible for ALL the sites in the region

• CIC can contact site directly and notify ROC• ROC is responsible for follow-up

CERN IT-GD

Support workflow

• Problem 1001• Problem 1002• Problem 1003• Problem 1004• Problem 1005• Problem 1006• Problem 1007

Problem List

RC (site) RC (site)

RC (site)

Manage theProblem List

GOC Tools

GridICE

Gppmon

Site CERT

Gstat

ROC ROCROC ROC ROC

RC (site) RC (site)

RC (site) RC (site)RC (site)

RC (site)


Problem List


Problem ListFAQFAQ

DOCDOC

CERN IT-GD

Daily ops mode

• CIC-on-duty (described by Lyon)• Responsibility rotates through CIC’s – one week at a time• Manage daily operations – oversee and ensure

• Problems from all sources are tracked (entered into PTS)• Problems are followed up• CIC-on-duty hands over responsibility for problems

• Hand-over in weekly operations meeting

• Daily ops:• Checklist• Various problem sources: monitors, maps, direct problem

reports• Need to develop tools to trigger alarms etc

CERN IT-GD

Escalation procedures

• Need service level definitions (Grid 3 site charter)• What a site supports (apps, software, MPI, compilers, etc)• Levels of support (# admins, hrs/day, on-call, operators…)• Response time to problems• Agreement (or not) that remote control is possible

(conditions)

• Sites sign-off on responsibilities/charter• Publish sites as bad in info system

• Based on unbiased checklist (written by CICs)• Consistently bad sites escalate to political level GDB/PMB

• Small/bad sites• Remote management of services• Remote fabric monitoring (GridICE etc)

CERN IT-GD

Deployment procedures

• How to formally capture site feedback• Priorities for next release, …

• Web page where info is presented• What’s in releases, etc.

• How to “force” sites to deploy new releases• ROC responsibility • Mark site as “bad”• Escalation to GDB, EGEE PMB

CERN IT-GD

Communications

• GDA Operations weekly meeting• (Grid3 daily mtg service desk+engineers)

• Could be a model within regions ROCs + sites

• General news info page• RSS customised feeds

• Various communities• General users

• “Run control” – messaging/alarm aggregation – sends messages/notifications to ops consoles

• Use (eg) Jabber as comm tool between CICs (and other operators – ROCs)

• Mailing lists• Rollout• Announcements (GOC web page – make people look daily)

CERN IT-GD

Tools needed

• GGUS + interfaces to Savannah + local PRMS’s• (start with Savannah as central aggregator)

• Monitoring console• Monitors (mostly have now)• Frameworks – to allow stats and triggers of alarms,

notifications, etc.

• GSI-enabled SUDO (etc) for remote service management

• Fabric management “cook-book”• Remote fabric monitors

CERN IT-GD

24x7 extended support

• Separate security (urgent) from general support• Distributed CIC provides “24x7” by using EU, Taipei,

America• Real 24x7 coverage only at Tier 0 and 1

• Or other specific crucial services that justify cost• Loss of capacity – vs damage• Classify what are 24x7 problems

• Direct user support not needed for 24x7• Massive failure should be picked by operations tools

CERN IT-GD

Interfaces to user support

• Same structure as for ops support• Regional support is needed, but central aggregation (might

need language translation)• Need inter-ROC agreement on common formats etc.

• Users free to submit anywhere (local or global)• All in same PTS GGUS (ops and user)• Documentation and example repository needed in

central place• Coord done by ROC managers• Still need to clarify workflows and make sure people

are in place to do the support• GGUS becomes “the” central problem tracker

• Essential that have rapid evolution as we learn the processes

Fabric Management WG – CERN, 2-4/11/2004 – Davide Salomoni

Working Group 4: Fabric Management

• Overall goals: look at site experience in the area of fabric management; focus on fabric management/operations in the context of grid services; understand where common problems or weak areas are; try to converge on some best practices. How can we reduce the frequently heard lament that local problems are one of the main reasons for suboptimal grid reliability?

• Tentative list of topics:– System installations (LCFGng; Quattor; manual; script-based; what else?). Experience? Which

requirements? (e.g.: why do you want/need Quattor? Or manual installation procedures? Or other things? Mix and match?) Which commitments?

– Batch/scheduling systems; flavors? Work on testing / development / improvements in this area? Documentation/management tools? How do we define/enforce policies? What policies? (Hard limits? Fairshares?)

– How do we monitor the fabric and what do we do with monitoring info; explore what sites are using / have developed; or what they would like to have, and miss. How do we certify/select fabric components? How do we provide status information to fabric users?

– Fabric set-ups for the grid (shared file systems? MPI support? Parallel FS? Hyperthreading? Bad/good experiences to share? Use cases? How do we manage storage? Networking?)

– Upgrade procedures, how are they triggered, and how do we cope with them? For example: security/performance patches to clusters; upgrades (or downgrades) to batch systems

– How do/can we provide fabric-related feedback to LCG/EGEE?– How do we disseminate information about fabric/grid issues? (Fabric training for sites? Training on how

batch systems are used – for app developers? Do we [want to] publish policies?)– Conflicts with other (non-grid / non-LCG) uses of the fabric?– Can we identify points of contact for (some of) these topics?– How do we keep on discussing these things? Hepix? Some sort of LCG/EGEE-specific forum?– ?

• Format– For each topic: informal presentations from site operations; discussion; actions, timeline


System Installations (1)• Unanimous desire to have a community-supported way of dealing with

system installations– Rather than reinventing/discovering solutions independently

• Very strong interest in using Quattor– Most sites currently use LCFG, with a couple of sites mix and match systems

(LCFG + scripts – not the ones presented on Tuesday by Laurence/Louis)– 4 sites (including CERN) have already installed Quattor servers

• Key point: provide at the very least installation, configuration of Worker Nodes• Installation reports being written (NIKHEF, RAL, CNAF)

– Pros• Modularity, possibility of switching on/off individual components• Load balancing, scalability capabilities

– Cons• Perfectible installation guide, perceived complexity in setting the system up• Several components have already been written at CERN, but there is no clear

publicity, nor testing outside of CERN (example: afs component)• LCG components are also available (Cal Loomis), but, again, with limited

documentation


System Installations (2)• Positioning: generic installation procedures (formerly: “manual

installations”) are not competing with Quattor– Rather, we expect to build on top of the generic installation scripts

• Resources– Within November, we’ll have a formal “Quattor WG” under the auspices of the

GDB– SA1 people can officially work on Quattor (i.e., account for time spent on

Quattor in the EGEE timesheets). There is no “free lunch” here either; we expect the formation of a virtuous circle with this WG to help with configuration and support of Quattor components.

• Actions– We need not wait for the formation of the Quattor WG to start doing something– Provide details on Quattor scalability (comparable to an analogous work made

in EDG times for LCFG) and resilience, with an How-To – German Cancio– Better document which components are available at CERN, and encourage

their usage outside of CERN – German Cancio– Organize a Quattor training for interested parties – CERN (around December)


Fabric Monitoring• Very diverse picture, all the usual suspects are present: Nagios, Ganglia,

GridICE, LEMON, a variety of self-made scripts– A complaint: there are too many monitoring systems. A suggestion: this needs

to be discussed in detail within SA1.• But most of the sites also say that they plan to invest time and resources in finding

more comprehensive monitoring system (so, potentially, we’ll have even more monitoring systems)

– Most of these systems focus on hardware-related metrics (CPU Temperature, fan rotation speed, etc)

– Very worrying: only 2 sites actually have systems that proactively try to deal with “black holes”

• On the other hand, several sites have experienced black holes, and rely on just manual intervention following automatic notification mechanisms (best case), or user notifications (worst case). This can lead to varying degrees of unreliability.

• Actions– Create a subgroup to document which tests should be made at the fabric level

to verify that WNs, batch system are in good order• Try out Piotr’s script locally• Validate and integrate existing scripts• Weizmann, RAL, CNAF, IN2P3, NIKHEF, CERN (German Cancio)


Some Farm Set-up Issues (1)• Hyper-Threading

– Most sites have it disabled– One site (CNAF) enabled it after an explicit request from an

experiment. But this does not really scale or work if:• You have multi-purpose center, or fairshare allocations (rather

than statically defined sub-clusters)• You do not have properly configured nodes (for example, enough

memory to avoid resource contention, enough jobs running on nodes to avoid suboptimal cpu usage, suitable OS)

• You have conflicting requests from experiments– A difficulty is that experiments do not seem interested in

benchmarking their applications (is this feasible at all)– Disk servers supposedly benefit from HT– Actions

• Gather existing HT data, and perform some tests – CNAF, IN2P3, RAL


Some Farm Set-up Issues (2)• What do you specify as technical requirements in

procurement tenders?– Thermal properties (horror stories heard)

• Serial-ATA vs. SCSI disks: is the price difference justified? This is an example of a topic being discussed in several places (CHEP, Hepix), and we need to channel the discussions into this forum/WG

• Operational fabric security: a site would like to have an up-to-date list of IP networks to which allow outbound connectivity (to reduce the impact of a DoS attack). A topic for WG1? (operational security)

• Survey of common use cases for farm set-ups: review document being written in SE Asia (to be out approx. in 1 week) – All


Communication Channels• We would like to have a single point of contact for grid-

sysadmin-related topics. But the format for this is not agreed yet– A web page/portal? Who is going to maintain it?

• Yet another suggestion to SA1?– LCG-ROLLOUT is too generic

• Do we need/want another mailing list?

• There are topics we would have liked to discuss, and did not have the time– Should getting together in a similar WS be repeated [more

frequently]? Should we wait 6 months for the Hepix meeting in Karlsruhe?

• We should report about what we discuss and do in the context of this WG’s activities at Hepix

– An item for discussion in the GDB? How is WG2 (operational support) going to address this?

lcg deployment workshop summary

Documents