tier-1 andrew sansum deployment board 12 july 2007
DESCRIPTION
Staff Changes Lex Holt (Fabric Team) left in June.TRANSCRIPT
Tier-1
Andrew SansumDeployment Board
12 July 2007
Agenda
• Monitoring• Deployment tools• Other stuff
Staff Changes
• Lex Holt (Fabric Team) left in June.
Network
• CERN Lightpath– 10Gb line to CERN working well but recently
suffered a 2 day break.• SuperJanet 5
– 10Gb link to site– 10Gb LAN on Tir-1– Share of 2Gb through firewall– Work underway for bypass for SRM/SE traffic
Hardware
• Hardware operating well – very stable• EU Tenders for:
– >1PB disk– >2MSI2K– 1PB tape (framework purchasing)– Tape drives (just beginning to start)Underway and expected to deliver in Q4.
Specification of technology required is very general and we are waiting to see solutions
RALSite
55105530
4 x 5530
RouterA
OPNRouter
3 x 5510+ 5530
6 x 5510+ 5530
ADSCaches
CPUs +Disks
CPUs +Disks
CPUs +Disks
CPUs +Disks
10Gb/s
10Gb/sto CERN
N x 1Gb/s10Gb/s
5 x 5510+ 5530
2 x 5510+ 5530
RALTier 2
Tier 1
Oraclesystems
1Gb/s to SJ4
Tier-1 LAN
CASTOR• 2.1.2 and previous releases of CASTOR:
– Implemented as a shared single instance– very unreliable with missing functionality– Unable to cope with various use cases– Essentially unusable
• How to make things better– Improve relationship with CERN get product improvements– 1 extra contractor
• 2.1.3 release now deployed:– Instance planned for ATLAS/CMS/LHCB/Others– Stable– Being load tested by CMS – Promising
dCache
• Still running version 1.7– Reliability reasonable
• Phase out had been planned for June/July but CASTOR not sufficiently advanced– Now plan to continue running dCache at least
until Christmas– Will give six months warning of closure
New Machine Room•Tender underway, planned completion: August 2008•800M**2 can accommodate 300 racks + 5 robots•2.3MW Power/Cooling capacity (some UPS)•Office accommodation for all E-Science staff•Combined Heat and Power Generation (CHP) on site•Not all for GRIDPP (but you get most)!
Reliability (Recent issues)
• RB – Continue to see:
• Load related issues• Database size issues (need frequent cleaning)
– Now running:• rb01/rb02 as general RB service• rb03 dedicated to Alice and LHCB
– Will add more if necessary but wish to minimise work on RB and wait for WMS
• Top level BDII– 3 servers (March) resolved timouts for a while but recurred recently– Recent upgrade to indexing version appears to have helped
• CE– Experienced unidentified load problem at start of June no
recurrence
SL4
• SL4 test service is available with a dedicated CE and a few worker nodes
• Expect to run both SL3 and SL4 concurrently and gradually migrate between the two.– Migration will take place as fast as experiments want– Capacity will initially be moved at experiment’s
request.– Once ATLAS/LHCB and CMS are migrated we will
announce a termination date of SL3 service
Grid Only• Long standing milestone that Tier-1 was to offer
a “Grid Only” service by the end of August 2007.• Recent discussion within UB concluded that the
absence of a reliable CASTOR prevented Tier-1 offering a Grid only service
• PMB has subsequently said that we should nevertheless move what we can to a Grid only service. (grid only job submission for example).
• Position statement needs to be submitted to PMB outlining what can be achieved.