lcg-2 operational experience in amsterdam

LCG-2 Operational Experience in Amsterdam

Davide Salomoni

NIKHEF

GDB, 13 October 2004 – Amsterdam

Talk Outline

• The Dutch Tier-1 Center

• Resource usage, monitoring

• User/grid support

NL Tier-1

• NIKHEF – 140 WNs (280 CPUs)– Around 4+ TB disk space– The farm is fairly heterogeneous, having been built over time– 3 people

• SARA – 30 WNs (60 CPUs)– Homogeneous farm (dual-Xeon 3.06 GHz) in Almere– TERAS (SGI Origin 3800 w/ a total of 1024 CPUs) as SE with

automatic migration to tape (capacity 1.2 PB) and a disk front-end cache of (currently) 500 GB

– Second tape library TBI soon w/ 250 TB – can grow up to 60 PB– 5 people

Farm Monitoring

• Both NIKHEF and SARA use ganglia– With some extensions, e.g. a ganglia/pbs interface

• Several stats are available for both admin and user consumption, for example:– The usual Ganglia pages

– Job details, per-experiment farm usage

Use of Resources

• torque/maui as batch system/scheduler– Way better than OpenPBS

• Custom RPMs– Basically to support the transient $TMPDIR patch (automatic

removal of temporary directories upon job completion) – implements a feature present in PBSpro and other batch systems.

– See http://www.dutchgrid.nl/install/edg-testbed-stuff/torque/ and http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/

• Extensive use of maui’s fairshare mechanism to set targets for (grid and local) users, groups, classes, etc. Sample config at http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ – Flexible, and complex

http://www.dutchgrid.nl/install/edg-testbed-stuff/torque/

http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/

http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/

Use of Resources (2)

• Check fairshare usage:Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusiveFor all users that are in groups XXXX.

Date CPU time WallTime GHzHours #jobs2004-09-04 00:00:00 00:00:10 0.00 62004-09-06 49:38:00 49:41:26 127.61 102004-09-07 155:32:36 159:15:56 388.77 92004-09-08 559:31:19 579:12:23 1336.88 142004-09-09 523:15:21 524:14:17 1202.94 252004-09-10 1609:29:32 1617:20:42 3685.88 892004-09-11 319:18:39 331:14:29 662.48 132004-09-12 96:58:59 97:24:11 194.81 22004-09-13 131:43:08 133:06:45 266.23 62004-09-14 214:41:10 215:44:00 431.47 112004-09-15 59:56:58 65:24:52 130.83 52004-09-16 38:50:30 39:06:36 78.22 32004-09-17 432:55:49 452:22:26 938.97 62004-09-18 95:35:22 96:00:23 192.01 12004-09-19 95:26:31 96:00:17 192.01 12004-09-20 10:09:34 10:17:38 20.59 222004-09-21 49:06:40 49:45:10 99.51 32004-09-22 88:14:41 88:37:06 177.24 22004-09-23 184:45:49 214:44:09 429.47 3

Summed 4715:10:38 4819:32:56 10555.91 231

Babysitting the (local) Grid

• A number of home-built scripts try to keep the system under control– Check for unusually short wallclock times repeated in short

succession on the same node(s) – often an indication of a black hole

– Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system

– Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case)

– Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions

– Cache various pbs utilities to work around many (= sometimes 30/sec) unnecessary queries coming from the job manager(s)

On Being Monitored

• We certainly want to beon this map

• But there seem to betoo many testing scripts

• Two main problems:– Given the way the existing job manager works, current

monitoring can create disruption on busy sites. David Smith promised to change some important things in this area.

– GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT)

– “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to probably work out a different strategy.

Grid Software

• Not always clear when new releases are to be out, with which upgrade policies

• Configuration management getting worse sometimes:– Due to schema limitations, we now need one queue per

experiment. But this has to be done manually with some care (or your GRIS risks to loose some info – see ce-static.ldif and ceinfo-wrapper.sh)

– The LCG [software, infrastructure] is not used by LHC experiments only, although this seems assumed here and there (proliferation of environment variables [e.g. WN_DEFAULT_SE_exp], some default files assume you want to support some VOs [e.g. SIXT]). Nothing dramatic, of course. But for a Tier-X center supporting VOs other than those of the canonical 4 experiments, this can lead to time/effort inefficiencies.

Grid Support• There are too many points of contact here and there, and they seem

often not very correlated– LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ – Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-

testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK.

Furthermore, no actions seem to be taken anymore when a site fails.– GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php

• Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example)

– FAQs abound (but are too uncorrelated):• GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei:

http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA)

– LCG-ROLLOUT seems a good troubleshooting forum– GGUS

http://goc.grid-support.ac.uk/gridsite/gocmain/

Grid Support• We have, like many, the usual set of web pages to support users. See e.g.

http://www.dutchgrid.nl – Yes we also have our own FAQ list– Ticketing system, telephone, email contact, etc.

• Grid tutorials are quite popular (one is running today)• Common problems:

– Users: access to UI, how to run jobs, how to access resources (e.g. store/retrieve data)

– Support for specific packages (not every VO uses lcg-ManageSoftware/lcg-ManageVO), suggestions on how to best (not to) use grid resources

– Mandatory hardware problems– Batch system fun– Firewall considerations (examples: number of streams in edg-rm,

SITE_GLOBUS_TCP_RANGE for WNs)• Supported VOs include alice, cms, esr, ncf, lhcb, atlas, dteam, dzero, pvier,

astron, astrop, tutor, vle, asci, nadc, magic[, biomed]– Some scripts developed to ease addition of new VOs

What Now?

• EGEE, of course– Or not?

• Quattor– as Jeff says, the speed we react to installation/change

requests is proportional to E-n, where E is the effort required and n some number > 1[, n being higher if any perl is involved]

• Improve proactive monitoring and batch system efficiency; develop experience and tune high-speed [real] data transfer

TERAS Storage Element• SGI Origin3800

– 32x R14k mips 500 MHz, 1GB/processor– TERAS Interactive node and Grid SE – Mass Storage environment

• SGI TP9100, 14 TB RAID5, FC-based• CXFS SAN shared file system

– Home file systems– Grid SE file system = Robust Data Challenge FS

» max 400 MB/s– Batch scratch file systems

• DMF/TMF Hierarchical Storage Management– Transparent data migration to tape– Home file systems– Grid SE file system

SAN Amsterdam

External Network Connectivity

lcg-2 operational experience in amsterdam

Documents

job completion

job managerson

existing job manager

pbs log

pbs server

tb disk spacethe farm

batch systemschedulerway

aggregate use