![Page 1: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/1.jpg)
Tier-1 Status
Andrew SansumGRIDPP18
21 March 2007
![Page 2: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/2.jpg)
Staff Changes
• Steve Traylen left in September• Three new Tier-1 staff
– Lex Holt (Fabric Team) – James Thorne (Fabric Team)– James Adams (Fabric Team)
• One EGEE funded post to operate a PPS (and work on integration with NGS):– Marian Klein
![Page 3: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/3.jpg)
Team Organisation
Grid Services
Grid/Support RossConduracheHodgesKlein (PPS)Vacancy
Fabric(H/W and OS)
Bly (team leader)WheelerHoltThorneWhite (OS support)Adams (HW support)
CASTORSW/Robot
Corney (GL)Strong (Service Manager)Folkes (HW Manager)deWittJensenKrukKetleyBonnet2.5 FTE effort
Machine Room operations
Networking Support
Database Support (Brown)
Project Management (Sansum/Gordon/(Kelsey))
![Page 4: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/4.jpg)
Hardware Deployment - CPU
• 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K)
• Completed acceptance tests over Christmas and into production mid January
• CPU farm capacity now (approximately):– 600 systems– 1250 cores– 1500 KSI2K
![Page 5: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/5.jpg)
Hardware Deployment - Disk
• 2006 was a difficult year with deployment hold-ups:– March 2006 delivery: 21 servers, Areca RAID
controller – 24*400GB WD (RE2) drives. Available: January 2007
– November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR)
– January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March 2007. Ready to deploy to CASTOR
![Page 6: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/6.jpg)
Disk Deployment - Issues
• March 2006 (Clustervision) delivery:– Originally delivered with 400GB WD400YR drives– Many drive ejects under normal load test (had worked OK when we
tested in January).– Drive specification found to have changed – compatibility problems with
RAID controller (despite drive being listed as compatible)– Various firmware fixes tried – improvements but not fixed.– August 2006 WD offer to replace for 500YS drive– September 2006 – load test of new configuration begin to show
occasional (but unacceptably frequent) drive ejects (different problem).– Major diagnostic effort by Western Digital – Clustervision also trying
various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low)..
– Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware.
– Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.
![Page 7: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/7.jpg)
Disk Deployment - Cause
• Western digital working at 2 sites – logic analysers on SATA interconnect.
• Eventually fault traced to a “missing return” in the firmware:– If drive head stays too long in one place it
repositions to allow lubricant to migrate.– Only shows up under certain work patterns– No return following reposition and 8 seconds
later controller ejects drive
![Page 8: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/8.jpg)
Disk Deployment
#Servers
Capacity (TB)
2006 57 179
Jan 2007 21 190
Feb 2007 47 238
March 2007
39 197
Total 138 750
![Page 9: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/9.jpg)
Hardware Deployment - Tape
• SL8500 tape robot upgraded to 10000 slots in August 2006.
• GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP)
• Further purchase of 350TB tape media in February 2007.
• Total Tape capacity now 850-900TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.
![Page 10: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/10.jpg)
Hardware Deployment - Network
• 10GB line from CERN available in August 2006• RAL was scheduled to attach to Thames Valley Network
(TVN) at 10GB by November 2006:– Change of plan in November – I/O rates from Tier-1 already
visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1)
– Connection delayed but now scheduled for end of March
• GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5.
• A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5
![Page 11: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/11.jpg)
RALSite
55105530
4 x 5530
RouterA
OPNRouter
3 x 5510+ 5530
6 x 5510+ 5530
ADSCaches
CPUs +Disks
CPUs +Disks
CPUs +Disks
CPUs +Disks
10Gb/s
10Gb/sto CERN
N x 1Gb/s
10Gb/s
5 x 5510+ 5530
2 x 5510+ 5530
RALTier 2
Tier 1
Oraclesystems
1Gb/s to SJ4
Tier-1 LAN
![Page 12: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/12.jpg)
New Machine Room•Tender underway, planned completion: August 2008•800M**2 can accommodate 300 racks + 5 robots•2.3MW Power/Cooling capacity (some UPS)•Office accommodation for all E-Science staff•Combined Heat and Power Generation (CHP) on site•Not all for GRIDPP (but you get most)!
![Page 13: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/13.jpg)
Tier-1 Capacity delivered to WLCG
(2006)
Asia Pacific4%
BNL18%
CERN11%
FNAL5%
FZK6% IN2P3
8%
INFN-T115%
PIC5%
RAL17%
SARA/NIKHEF10%
Others1%
![Page 14: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/14.jpg)
Last 12 months CPU Occupancy
+260 KSI2KMay 2006
+550 KSI2KJanuary 2007
![Page 15: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/15.jpg)
Recent CPU Occupancy (4 weeks)
Air-conditioning Work (300KSI2K offline)
![Page 16: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/16.jpg)
CPU Efficiencies
![Page 17: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/17.jpg)
CPU Efficiencies
CMS merge jobs – hang on CASTOR
ATLAS/LHCB jobs hanging on dCache
Babar jobs running slow – reason unknown
![Page 18: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/18.jpg)
3D Service
• Used by ATLAS and LHCB to distribute conditions data by Oracle streams
• RAL one of 5 sites who deployed a production service during Phase I.
• Small SAN cluster – 4 nodes, 1 Fibre channel RAID array.
• RAL takes a leading role in the project.
![Page 19: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/19.jpg)
Reliability
• Reliability matters to the experiments.– Use the SAM monitoring to identify priority areas – Also worrying about job loss rates
• Priority at RAL to improve reliability:– Fix the faults that degrade our SAM availability– New exception monitoring and automation system
based on Nagios
• Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.
![Page 20: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/20.jpg)
Reliability - CE
• Split PBS server and CE long time ago• Split CE and local BDII• Site BDII times out on CE info provider
– CPU usage very high on CE info provider “starved”– Upgraded CE to 2 cores.
• Site BDII still times out on CE info provider – CE system disk I/O bound– Reduce load (changed backups etc)– Finally replaced system drive with faster model.
![Page 21: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/21.jpg)
CE Load
![Page 22: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/22.jpg)
Job Scheduling
• Sam Jobs failing to be scheduled by MAUI– SAM tests running under operations VO, but share gid with
dteam. dteam has used all resource – thus MAUI starts no more jobs
– Change scheduling and favour ops VO (Long term plan to split ops and dteam)
• PBS server hanging after communications problems – Job stuck in pending state jams whole batch system (no job
starts – site unavailable!)– Auto detect state of pending jobs and hold – remaining jobs
start and availability good– But now held jobs impact ETT and we receive less work from RB
– have to delete held jobs
![Page 23: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/23.jpg)
Jobs de-queued at CE
• Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job.– Only impacts SAM monitoring occasionally – May be impacting users more than SAM but
we cannot tell from our logs– Logged a GGUS ticket but no resolution
![Page 24: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/24.jpg)
RB
• RB running very busy for extended periods during the summer:– Second RB (rb02) added early November but no
transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki).
• Jobs found to abort on rb01 linked to size of database– Database needed cleaning (was over 8GB)
• Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.
![Page 25: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/25.jpg)
RB Load
rb02 deployed
Drained to fix hardware
rb02 High CPU Load
![Page 26: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/26.jpg)
Top Level BDII
• Top level BDII not reliably responding to queries– Query rate too high – UK sites failing SAM tests for extended periods
• Upgraded BDII to two servers on DNS round robin– Sites occasionally fail SAM test
• Upgraded BDII to 3 servers (last Friday)– Hope problem fixed – please report timeouts.
![Page 27: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/27.jpg)
FTS
• Reasonably reliable service– Based on a single server– Monitoring and automation to watch for
problems
• At next upgrade (soon) move from single server to two pairs:– One pair will handle transfer agents– One pair will handle web front end.
![Page 28: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/28.jpg)
dCache
• Problems with gridftp doors hanging– Partly helped by changes to network tuning– But still impacts SAM tests (and experiments).
Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM)
• Had hoped this month’s upgrade to version 1.7 would resolve problem– Didn’t help– Have now upgraded all gridftp doors to Java 1.5. No
hangs since upgrade last Thursday.
![Page 29: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/29.jpg)
SAM Availability
RAL-LCG2 Availability/Reliability
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
May-06 Jun-06 Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07
Available
Old Reliability
New Reliability
Target
Average
Best 8
![Page 30: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/30.jpg)
CASTOR• Autumn 2005/Winter 2005:
– Decide to migrate tape service to CASTOR– Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2
deployment starts• Spring/Summer 2006: Major effort to deploy and understand CASTOR
– Difficult to establish a stable pre-production service– Upgrades extremely difficult to make work – test service down for weeks at a time
following upgrade or patching.• September 2006:
– Originally planned we have full production service – Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06
• October 2006– But we don’t have any disk – have to – BIG THANK YOU PPD!– CASTOR performs well in CSA06
• November/December work on CASTOR upgrade but eventually fail to upgrade• January 2007 declare CASTOR service as production quality• Feb/March 2007:
– Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown).
– Start work with Atlas/LHCB and MINOS to migrate to CASTOR
![Page 31: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/31.jpg)
CASTOR Layout
ralsrma ralsrmb ralsrmc ralsrmd ralsrme ralsrmf
D1T0
cmswanout
D0T1 prdD0T1 tmpD0T1 CMSwanin
cmsFarmRead lhcbD1T0 atlasD1T0prod
atlasD1T0usr atlasD1T1 atlasD0T1test atlasD1T0test
SRM 1
Disk
Pools
service classes
![Page 32: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/32.jpg)
CMS
![Page 33: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/33.jpg)
Phedex Rate to CASTOR (RAL Destination)
![Page 34: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/34.jpg)
Phedex Rate to CASTOR RAL Source
![Page 35: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/35.jpg)
SL4 and gLite
• Preparing to migrate some batch workers to SL4 for experiment testing.
• Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release
![Page 36: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/36.jpg)
Grid Only
• Long standing milestone that Tier-1 will offer a “Grid Only” service by the end of August 2007.
• Discussed at January UB. Considerable discussion WRT what “Grid Only” means.
• Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.
![Page 37: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007](https://reader034.vdocuments.site/reader034/viewer/2022052618/5515c34455034693758b473f/html5/thumbnails/37.jpg)
Conclusions
• Last year was a tough year but we have eventually made good progress.– A lot of problems encountered– A lot accomplished
• This year focus will be on:– Establishing a stable CASTOR service that meets the
needs of the experiments– Deploying required releases of SL4/gLite– meeting (hopefully exceeding) availability targets– Hardware ramp up as we move towards GRIDPP3