lcg-2 operational experience in amsterdam
DESCRIPTION
LCG-2 Operational Experience in Amsterdam. Davide Salomoni NIKHEF GDB, 13 October 2004 – Amsterdam. Talk Outline. The Dutch Tier-1 Center Resource usage, monitoring User/grid support. NL Tier-1. NIKHEF – 140 WNs (280 CPUs) Around 4+ TB disk space - PowerPoint PPT PresentationTRANSCRIPT
LCG-2 Operational Experience in Amsterdam
Davide Salomoni
NIKHEF
GDB, 13 October 2004 – Amsterdam
Talk Outline
• The Dutch Tier-1 Center
• Resource usage, monitoring
• User/grid support
NL Tier-1
• NIKHEF – 140 WNs (280 CPUs)– Around 4+ TB disk space– The farm is fairly heterogeneous, having been built over time– 3 people
• SARA – 30 WNs (60 CPUs)– Homogeneous farm (dual-Xeon 3.06 GHz) in Almere– TERAS (SGI Origin 3800 w/ a total of 1024 CPUs) as SE with
automatic migration to tape (capacity 1.2 PB) and a disk front-end cache of (currently) 500 GB
– Second tape library TBI soon w/ 250 TB – can grow up to 60 PB– 5 people
Farm Monitoring
• Both NIKHEF and SARA use ganglia– With some extensions, e.g. a ganglia/pbs interface
• Several stats are available for both admin and user consumption, for example:– The usual Ganglia pages
– Job details, per-experiment farm usage
Use of Resources
• torque/maui as batch system/scheduler– Way better than OpenPBS
• Custom RPMs– Basically to support the transient $TMPDIR patch (automatic
removal of temporary directories upon job completion) – implements a feature present in PBSpro and other batch systems.
– See http://www.dutchgrid.nl/install/edg-testbed-stuff/torque/ and http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/
• Extensive use of maui’s fairshare mechanism to set targets for (grid and local) users, groups, classes, etc. Sample config at http://www.dutchgrid.nl/install/edg-testbed-stuff/maui/ – Flexible, and complex
Use of Resources (2)
• Check fairshare usage:Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusiveFor all users that are in groups XXXX.
Date CPU time WallTime GHzHours #jobs2004-09-04 00:00:00 00:00:10 0.00 62004-09-06 49:38:00 49:41:26 127.61 102004-09-07 155:32:36 159:15:56 388.77 92004-09-08 559:31:19 579:12:23 1336.88 142004-09-09 523:15:21 524:14:17 1202.94 252004-09-10 1609:29:32 1617:20:42 3685.88 892004-09-11 319:18:39 331:14:29 662.48 132004-09-12 96:58:59 97:24:11 194.81 22004-09-13 131:43:08 133:06:45 266.23 62004-09-14 214:41:10 215:44:00 431.47 112004-09-15 59:56:58 65:24:52 130.83 52004-09-16 38:50:30 39:06:36 78.22 32004-09-17 432:55:49 452:22:26 938.97 62004-09-18 95:35:22 96:00:23 192.01 12004-09-19 95:26:31 96:00:17 192.01 12004-09-20 10:09:34 10:17:38 20.59 222004-09-21 49:06:40 49:45:10 99.51 32004-09-22 88:14:41 88:37:06 177.24 22004-09-23 184:45:49 214:44:09 429.47 3
Summed 4715:10:38 4819:32:56 10555.91 231
Babysitting the (local) Grid
• A number of home-built scripts try to keep the system under control– Check for unusually short wallclock times repeated in short
succession on the same node(s) – often an indication of a black hole
– Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system
– Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case)
– Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions
– Cache various pbs utilities to work around many (= sometimes 30/sec) unnecessary queries coming from the job manager(s)
On Being Monitored
• We certainly want to beon this map
• But there seem to betoo many testing scripts
• Two main problems:– Given the way the existing job manager works, current
monitoring can create disruption on busy sites. David Smith promised to change some important things in this area.
– GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT)
– “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to probably work out a different strategy.
Grid Software
• Not always clear when new releases are to be out, with which upgrade policies
• Configuration management getting worse sometimes:– Due to schema limitations, we now need one queue per
experiment. But this has to be done manually with some care (or your GRIS risks to loose some info – see ce-static.ldif and ceinfo-wrapper.sh)
– The LCG [software, infrastructure] is not used by LHC experiments only, although this seems assumed here and there (proliferation of environment variables [e.g. WN_DEFAULT_SE_exp], some default files assume you want to support some VOs [e.g. SIXT]). Nothing dramatic, of course. But for a Tier-X center supporting VOs other than those of the canonical 4 experiments, this can lead to time/effort inefficiencies.
Grid Support• There are too many points of contact here and there, and they seem
often not very correlated– LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ – Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-
testzone-reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK.
Furthermore, no actions seem to be taken anymore when a site fails.– GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php
• Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example)
– FAQs abound (but are too uncorrelated):• GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei:
http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA)
– LCG-ROLLOUT seems a good troubleshooting forum– GGUS
Grid Support• We have, like many, the usual set of web pages to support users. See e.g.
http://www.dutchgrid.nl – Yes we also have our own FAQ list– Ticketing system, telephone, email contact, etc.
• Grid tutorials are quite popular (one is running today)• Common problems:
– Users: access to UI, how to run jobs, how to access resources (e.g. store/retrieve data)
– Support for specific packages (not every VO uses lcg-ManageSoftware/lcg-ManageVO), suggestions on how to best (not to) use grid resources
– Mandatory hardware problems– Batch system fun– Firewall considerations (examples: number of streams in edg-rm,
SITE_GLOBUS_TCP_RANGE for WNs)• Supported VOs include alice, cms, esr, ncf, lhcb, atlas, dteam, dzero, pvier,
astron, astrop, tutor, vle, asci, nadc, magic[, biomed]– Some scripts developed to ease addition of new VOs
What Now?
• EGEE, of course– Or not?
• Quattor– as Jeff says, the speed we react to installation/change
requests is proportional to E-n, where E is the effort required and n some number > 1[, n being higher if any perl is involved]
• Improve proactive monitoring and batch system efficiency; develop experience and tune high-speed [real] data transfer
TERAS Storage Element• SGI Origin3800
– 32x R14k mips 500 MHz, 1GB/processor– TERAS Interactive node and Grid SE – Mass Storage environment
• SGI TP9100, 14 TB RAID5, FC-based• CXFS SAN shared file system
– Home file systems– Grid SE file system = Robust Data Challenge FS
» max 400 MB/s– Batch scratch file systems
• DMF/TMF Hierarchical Storage Management– Transparent data migration to tape– Home file systems– Grid SE file system
SAN Amsterdam
External Network Connectivity