tier1a status andrew sansum gridpp 8 23 september 2003
TRANSCRIPT
EDG Status 1 (Steve Traylen)
• EDG 2.0.x deployed on production testbed since early September. Provides:– EDG RGMA info catalogue:
http://gppic06.gridpp.rl.ac.uk:8080/R-GMA/– RLS for lhcb, biom, eo, wpsix, tutor and babar
• EDG 2.1 deployed on dev testbed. VOMS integration work underway. May be found useful by small GRIDPP experiments (eg NA48, MICE and MINOS)
EDG Status (2)
• EDG 1.4 gatekeeper continues to provide gateway into main CSF production farm. Provides access for small amount of Babar and ATLAS work. Being prepared for forthcoming D0 production via SAMGrid
• Along with IN2P3, CSFUI provides main UI for EDG
• Many WP3 and WP5 mini testbeds• Further GRID integration into production
farm via LCG – not EDG
LCG Integration (M. Bly)
• LCG 0 mini testbed deployed in July• LCG 0 upgraded to LCG 1 in September.
Consists of:– Lcgwst regional GIIS– RB– CE, SE, UI, BDII, PROXY– Five worker nodes
• Soon need to make important decisions about how much hardware to deploy into LCG – whatever experiments/EB want.
LCG Experience
• Mainly known issues:– Installation and configuration still difficult
for non experts.– Documentation still thin in many places.– Support often very helpful but answers not
always forthcoming for some problems.– Not everything works – all of the time.
• Beginning to discuss internally how to interoperate with production farm.
SRB Service For CMS
• Considerable learning experience for Datastore team (and CMS)!
• SRB MCAT for whole CMS production. Consists of enterprise class ORACLE servers and thin” MCAT ORACLE client.
• SRB interface into Datastore• SRB enabled disk server to handle data
imports.• SRB clients on disk servers for data moving
New Hardware (March)
• 80 Dual Processor P4 2.66GHz Xeon• 11 disk servers: 40TB IDE disk
– 11 dual P4 servers (with PCIx), each with 2 Infortrend IFT-6300 arrays
– 12 Maxtor 200GB Diamondmax Plus 9 drives per array.
• Major Datastore upgrade over summer
P4 Operation Problematic
• Disappointing performance with gcc– Hope for 2.66P4/1.4P3=1.5– see 1.2 - 1.3
• Can obtain more by exploiting hyper-threading but Linux CPU scheduling causes difficulties (ping pong effects)
• CPU accounting now depends on number of jobs running.
• Beginning to look closely at Opteron solutions.
Datastore Upgrade
• STK 9310 robot, 6000 slots– IBM 3590 drives being phased out (10GB
10MB/Sec)– STK 9940B drives in production (200GB
30MB/sec)
• 4 IBM 610+ servers with two FC connections and Gbit networking on PCI-X– 9940 drives FC connected via 2 switches for
redundancy– SCSI raid 5 disk with hot spare for 1.2Tbytes
cache space
Switch_1 Switch_2
RS6000 RS6000RS6000 RS6000
fsc0 fsc1 fsc1fsc0
9940B 9940B 9940B 9940B 9940B 9940B 9940B 9940B
1 2 3 4 5 6 7 8
11 14 11 1415
fsc1fsc0fsc1fsc0
12 13 12 13 15
rmt1 rmt4rmt3rmt2rmt5-8 rmt5-8rmt5-8rmt5-8
A A A A A A A A
STK 9310 “Powder Horn”
Gbit network
1.2TB 1.2TB 1.2TB 1.2TB
Operating Systems
• Redhat 6.2 finally closed in August• Redhat 7.2 remains in production for
Babar. Will migrate all batch workers to Redhat 7.3 shortly.
• Redhat 7.3 service now main workhorse for LHC experiments.
• Need to start looking at Redhat 9/10• Need to deploy Redhat Advanced Server
Next Procurement
• Based on experiments expected demand profile (as best they can estimate).
• Exact numbers still being finalised, but about:– 250 dual processor CPU nodes– 70TB available disk– 100TB tape
0
100
200
300
400
500
600
700
800
900
1000
UKQCD
Other
D0
Alice
LHCb
Atlas
CMS
BaBar
GPP-only
90%
Capacity
CPU Requirements (KSI2K)
GridPP Disk Requirements (TB)
0
20
40
60
80
100
120
140
160
LCG
Others
UKQCD
D0
Alice
LHCb
Atlas
CMS
BaBar
90%
Capacity
New Helpdesk
• Need to deploy new helpdesk (had Remedy). Wanted:– Web based.– Free open source.– Multiple queues and personalities.
• Looked at Bugzilla, OTRS and Requestracker.• Finally selected request tracker.• http://helpdesk.gridpp.rl.ac.uk/.• Available for other Tier 2 sites and other
GRIDPP projects if needed.
YUMIT: RPM Monitoring
• Many nodes on the farm. Need to make sure RPMs are up to date.
• Wanted light-weight solution until full fabric management tools are deployed.
• Package written by Steve Traylen:– Yum installed on hosts – Nightly comparison with YUM database
uploaded to MYSQL server.– Simple web based display utility in perl
Exception Monitoring: Nagios
• Already have an exception handling system (CERN’s SURE coupled with the commercial Automate).
• Looking at alternatives – no firm plans yet but currently looking at NAGIOS:http://www.nagios.org/
Summary: Outstanding Issues
• Many new developments and new services deployed this year.
• We have to run many distinct services. For example, FERMI Linux, RH 6.2/7.2/7.3, EDG testbeds, LCG, CMS DC03, SRB etc.
• Waiting to hear when the experiments want LCG in volume.
• The Pentium 4 processor is performing poorly.
• Redhat’s changing policy is a major concern