tier-1 andrew sansum deployment board 12 july 2007

Tier-1

Andrew SansumDeployment Board

12 July 2007

Agenda

• Monitoring• Deployment tools• Other stuff

Staff Changes

• Lex Holt (Fabric Team) left in June.

Network

• CERN Lightpath– 10Gb line to CERN working well but recently

suffered a 2 day break.• SuperJanet 5

– 10Gb link to site– 10Gb LAN on Tir-1– Share of 2Gb through firewall– Work underway for bypass for SRM/SE traffic

Hardware

• Hardware operating well – very stable• EU Tenders for:

– >1PB disk– >2MSI2K– 1PB tape (framework purchasing)– Tape drives (just beginning to start)Underway and expected to deliver in Q4.

Specification of technology required is very general and we are waiting to see solutions

RALSite

55105530

4 x 5530

RouterA

OPNRouter

3 x 5510+ 5530

6 x 5510+ 5530

ADSCaches

CPUs +Disks

CPUs +Disks

CPUs +Disks

CPUs +Disks

10Gb/s

10Gb/sto CERN

N x 1Gb/s10Gb/s

5 x 5510+ 5530

2 x 5510+ 5530

RALTier 2

Tier 1

Oraclesystems

1Gb/s to SJ4

Tier-1 LAN

CASTOR• 2.1.2 and previous releases of CASTOR:

– Implemented as a shared single instance– very unreliable with missing functionality– Unable to cope with various use cases– Essentially unusable

• How to make things better– Improve relationship with CERN get product improvements– 1 extra contractor

• 2.1.3 release now deployed:– Instance planned for ATLAS/CMS/LHCB/Others– Stable– Being load tested by CMS – Promising

dCache

• Still running version 1.7– Reliability reasonable

• Phase out had been planned for June/July but CASTOR not sufficiently advanced– Now plan to continue running dCache at least

until Christmas– Will give six months warning of closure

New Machine Room•Tender underway, planned completion: August 2008•800M**2 can accommodate 300 racks + 5 robots•2.3MW Power/Cooling capacity (some UPS)•Office accommodation for all E-Science staff•Combined Heat and Power Generation (CHP) on site•Not all for GRIDPP (but you get most)!

Reliability (Recent issues)

• RB – Continue to see:

• Load related issues• Database size issues (need frequent cleaning)

– Now running:• rb01/rb02 as general RB service• rb03 dedicated to Alice and LHCB

– Will add more if necessary but wish to minimise work on RB and wait for WMS

• Top level BDII– 3 servers (March) resolved timouts for a while but recurred recently– Recent upgrade to indexing version appears to have helped

• CE– Experienced unidentified load problem at start of June no

recurrence

SL4

• SL4 test service is available with a dedicated CE and a few worker nodes

• Expect to run both SL3 and SL4 concurrently and gradually migrate between the two.– Migration will take place as fast as experiments want– Capacity will initially be moved at experiment’s

request.– Once ATLAS/LHCB and CMS are migrated we will

announce a termination date of SL3 service

Grid Only• Long standing milestone that Tier-1 was to offer

a “Grid Only” service by the end of August 2007.• Recent discussion within UB concluded that the

absence of a reliable CASTOR prevented Tier-1 offering a Grid only service

• PMB has subsequently said that we should nevertheless move what we can to a Grid only service. (grid only job submission for example).

• Position statement needs to be submitted to PMB outlining what can be achieved.

tier-1 andrew sansum deployment board 12 july 2007

Documents