gridpp deployment status gridpp15 jeremy coles [email protected] 11 th january 2006

27
GridPP Deployment Status GridPP15 Jeremy Coles [email protected] 11 th January 2006

Upload: earl-powell

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

GridPP Deployment Status

GridPP15

Jeremy [email protected]

11th January 2006

Page 2: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Overview

2 Some new sources of information

3 General deployment news

4 Expectations for the coming months

5 Preparing for SC4 and our status

6 Summary

1 An update on some of the high-level metrics

Page 3: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Prototype metric report

UKI is still contributing well but according to the SFT data our proportion of sites failing is relatively high

Page 4: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Snapshot of recent numbers

Region # sites Average CPU

Asia Pacific 8 450

CERN 19 4250

Central Europe 15 30

France 8 1100

Germany & Switzerland 11 3365

Italy 25 2250

Northern Europe 10 1160

Russia 14 430

South East Europe 20 370

South West Europe 14 700

UKI 36 3250

Most of the unavailable sites have been in Ireland as they make the move over to LCG 2.6.0.

Page 5: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Average job slots have increased

gradually

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

06/2

4/04

07/0

9/20

04

07/2

4/04

08/0

8/20

04

08/2

3/04

09/0

7/20

04

09/2

2/04

10/0

7/20

04

10/2

2/04

11/0

6/20

04

11/2

1/04

12/0

6/20

04

12/2

1/04

01/0

5/20

05

01/2

0/05

02/0

4/20

05

02/1

9/05

03/0

6/20

05

03/2

1/05

04/0

5/20

05

04/2

0/05

05/0

5/20

05

05/2

0/05

06/0

4/20

05

06/1

9/05

07/0

4/20

05

07/1

9/05

08/0

3/20

05

08/1

8/05

09/0

2/20

05

09/1

7/05

10/0

2/20

05

10/1

7/05

11/0

1/20

05

11/1

6/05

12/0

1/20

05

EGEE total job slots UK total job slots

UK job slots have increased by about 10% since GridPP14. (See Steve Lloyd’s talk for how this looks against the targets)

Page 6: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Therefore our contribution to EGEE CPU resources remains

at ~20%

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

06/2

4/04

07/1

0/20

04

07/2

6/04

08/1

1/20

04

08/2

7/04

09/1

2/20

04

09/2

8/04

10/1

4/04

10/3

0/04

11/1

5/04

12/0

1/20

04

12/1

7/04

01/0

2/20

05

01/1

8/05

02/0

3/20

05

02/1

9/05

03/0

7/20

05

03/2

3/05

04/0

8/20

05

04/2

4/05

05/1

0/20

05

05/2

6/05

06/1

1/20

05

06/2

7/05

07/1

3/05

07/2

9/05

08/1

4/05

08/3

0/05

09/1

5/05

10/0

1/20

05

10/1

7/05

11/0

2/20

05

11/1

8/05

12/0

4/20

05

Date

Per

cen

tag

e co

ntr

ibu

tio

n

UK % total CPU

Page 7: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

However there still is not consistently high usage of job

slots

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

06/02

/200

4

06/19

/04

07/06

/200

4

07/23

/04

08/09

/200

4

08/26

/04

09/12

/200

4

09/29

/04

10/16

/04

11/02

/200

4

11/19

/04

12/06

/200

4

12/23

/04

01/09

/200

5

01/26

/05

02/12

/200

5

03/01

/200

5

03/18

/05

04/04

/200

5

04/21

/05

05/08

/200

5

05/25

/05

06/11

/200

5

06/28

/05

07/15

/05

08/01

/200

5

08/18

/05

09/04

/200

5

09/21

/05

10/08

/200

5

10/25

/05

11/11

/200

5

11/28

/05

Date

% j

ob

slo

ts u

sed

% EGEE slots used % UK slots used

Page 8: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

The largest GridPP users by VO for

2005

LHCb

ATLASBABAR

CMS

BIOMED

DZERO

ZEUS

NB: Excludes data from Cambridge – for Condor support in APEL see Dave Kant’s talk

Page 9: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Storage has seen a healthy increase – but usage

remains low

At the GridPP Project Management and Deployment Boards yesterday we discussed ways to encourage the experiments to make more use

of Tier-2 disk space – The Tier-1 will be unable to meet allocation requests. One of the underlying concerns is what do data flags mean

to Tier-2 sites.

Page 10: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Scheduled downtime

Views of data will be available from CIC portal from today! http://cic.in2p3.fr/

Page 11: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Scheduled downtime

Congratulations to Lancaster for being the only site to have no Scheduled Downtime

Page 12: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

SFT review

It was probably clear already that the majority of our failures (and those of other large ROCs) are lcg-rm (Failure points: replica catalog, configured BDII, CERN storage for replication OR a local SE problem) and rmga (generally badly configured site). We know the tests need to improve and become more reliable and accurate too!

Page 13: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Overall EGEE statistics

The same problems cover the majority of EGEE resources. Hours of impact will be available soon and will help us evaluate the true significance of

these results.

Page 14: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Completing the weekly reportsALL site administrators have now been

asked to complete information related to problems observed at their sites as recorded in the weekly operations report

This will impact our message at weekly EGEE operations reviews (http://agenda.cern.ch/displayLevel.php?fid=258) and YOUR Tier-2 performance figures!

Page 15: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Performance measures

• The GridPP Oversight Committee has asked us to investigate why some sites perform better than others. As well as looking at the SFT and ticket response data, the Imperial College group will help pull data from their LCG2 Real Time Monitor Daily Summary Reports: http://gridportal.hep.ph.ic.ac.uk/rtm/reports.html

Page 16: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

SUPPORT• User ticket response time • Number of “supporters”• # tickets escalated• % tickets wrongly assigned• ROC measures

SERVICE NODES (testing)• RB – submit to CE time• BDII – query time average• MyProxy – register/access/del• SRM-SE – test file movement• Catalogue test• VOMS• RGMA

EGEE metrics

• While we have defined GridPP metrics many are not automatically produced. EGEE now has metrics as a priority and at EGEE3 a number of metrics were agreed for the project and assigned.

SIZE• # of sites in production • # of job slots• Total available kSpecInt• Storage (disc)• Mass storage• # EGAP approved VOs• # active VOs• # active users• Total % used resources

DEPLOYMENT• Speed of m/w security update

OPERATIONS• Site responsiveness to COD • Site response to tickets• Site tests failed• % availability of SE, CE• # days downtime per ROC

USAGE• Jobs per VO (submit, comp, fail)• Data transfer per VO• CPU and storage usage per VO• % sites blacklisted/whitelisted• # CE/SE available to VO

Page 17: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

The status of 2.7.0

• Mon 9th Jan - tag and begin local testing of installations and upgrades on mini testbeds, complete documentation • Mon 16th Jan - pre-release to >3 ROCs for a week of further testing • Mon 23rd Jan - incorporate results of ROC testing and release asap• Release 2.7.0 (at the end of January!?)

Expect• Bug fixes – RB, BDII, CE, SE-classic, R-GMA, GFAL, LFC, SE_DPM• VOMS – new server client version• VO-BOX – various functionality changes• LFC/DPM updates• Lcg_utils/GFAL – new version & bug fixes• RB – new functionality for job status checking• Security issues – pool account recycling, signed rpm distribution• FTS clients & dCache 1.6.6.2• Some “VO management via YAIM” additions

Details of release work: https://uimon.cern.ch/twiki/bin/view/LCG/LCG-2_7_0

Page 18: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Outcomes of security challenge

Comments (Thanks to Alessandra Forti) • Test suites should be asynchronous• Security contacts mail list is not up to date • 4 sites CSIRTS did not pass on information – site security contacts

should be the administrators and not site CSIRTS• 1 site did not understand what to do• Majority of sites acknowledged tickets within a few hours once site

administrator received ticket• On average sites responded with CE data in less that 2 days (some

admins were unsure about contacting the RB staff)• 2 sites do not use lcgpbs jobmanager and were unable to find the

information in the log files (also 1 using Condor)• Some sites received more than one SSC job in 3 hr timeframe and

were unable to return an exact answer but gave several• Mistake in date – admins spotted inconsistencies• ROC struggled with ticket management and caused delays in

processing tickets!

Aside: The EGEE proposed Security Incident handbook is being reviewed by the deployment team: http://wiki.gridpp.ac.uk/wiki/Incident_Response_Handbook

Page 19: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Other areas of interest!

•The Footprints version (UKI ROC ticketing system) will be upgraded on 23rd January. This will improve our interoperations with GGUS and other ROCs (using xml emails). There should be little observable impact but we do ask PLEASE SOLVE & CLOSE as many currently open tickets as possible by 23rd January.

• Culham (the place which hosted the last operations workshop) will be adding a new UKI site in the near future. They will join or host the Fusion VO.

• Most sites have now completed the “10 Easy Network Questions” responses. http://wiki.gridpp.ac.uk/wiki/GridPP_Answers_to_10_Easy_Network_QuestionsThis has proved a useful exercise. What do you think?

• The deployment team has identified a number of operational areas to improve. These include such things as experiment software installation, VO support availability of certain information on processes (like where to start for new sites)

• Pre-production service: UKI now has 3 sites with gLite (components) either deployed or in the process of being deployed

Page 20: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Other areas of interest!

•The Footprints version (UKI ROC ticketing system) will be upgraded on 23rd January. This will improve our interoperations with GGUS and other ROCs (using xml emails). There should be little observable impact but we do ask PLEASE SOLVE & CLOSE as many currently open tickets as possible by 23rd January.

• Culham (the place which hosted the last operations workshop) will be adding a new UKI site in the near future. They will join or host the Fusion VO.

• Most sites have now completed the “10 Easy Network Questions” responses. http://wiki.gridpp.ac.uk/wiki/GridPP_Answers_to_10_Easy_Network_QuestionsThis has proved a useful exercise. What do you think?

• The deployment team has identified a number of operational areas to improve. These include such things as experiment software installation, VO support availability of certain information on processes (like where to start for new sites)

• Pre-production service: UKI now has 3 sites with gLite (components) either deployed or in the process of being deployed

REMINDER & REQUEST – Please enable more VOs! GridPP PMB requests that 0.5% (1% in EGEE-2) resources be used to support wider VOs – like BioMed. This will also get our utilisation higher. Feedback is going to developers on making adding VOs easier.

Page 21: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Our focus is now on Service Challenge 4

GridPP links, progress and status is being logged in the GridPP wiki: http://wiki.gridpp.ac.uk/wiki/Service_Challenges

SRM• 80% of sites have working (file transfers with 2 other sites successful) SRM by end of December • All sites have working SRM by end of January • 40% of sites (using FTS) able to transfer files using an SRM 2.1 API by end February • All sites (using FTS) able to transfer files using an SRM 2.1 API by end March • Interoperability tests between SRM versions at Tier-1 and Tier-2s (TBC)

FTS Channels• FTS channel to be created for all T1-T2 connections by end of January • FTS client configured for 40% sites by end January • FTS channels created for one Intra-Tier-2 test for each Tier-2 by end of January • FTS client configured for all sites by end March

A number of milestones (previously discussed at the 15th November UKI Monthly Operations Meeting) have been set. Red in text means milestone at risk (generally due to external dependencies) and Green text signifies done.

Page 22: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Core to these are the…

Data Transfers

Tier-1 to Tier-2 Transfers (Target rate 300-500Mb/s) •Sustained transfer of 1TB data to 20% sites by end December •Sustained transfer of 1TB data from 20% sites by end December •Sustained transfer of 1TB data to 50% sites by end January •Sustained transfer of 1TB data from 50% sites by end January •Sustained individual transfers (>1TB continuous) to all sites completed by mid-March •Sustained individual transfers (>1TB continuous) from all sites by mid-March •Peak rate tests undertaken for all sites by end March •Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March

Inter Tier-2 Transfers (Target rate 100 Mb/s) •Sustained transfer of 1TB data between largest site in each Tier-2 to that of another Tier-2 by end February •Peak rate tests undertaken for 50% sites in each Tier-2 by end February

Page 23: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

The current status

RAL Tier-1 Lancaster Manchester

Edinburgh Glasgow IC-HEP RAL-PPD

RAL Tier-1 ~800Mb/s

350Mb/s 156Mb/s 84Mb/s 309 Mb/s397 Mb/s

Lancaster 0 Mb/s

Manchester

Edinburgh 422Mb/s 210 Mb/s224 Mb/s

Glasgow 331Mb/s 122 Mb/s

IC-HEP

RAL-PPD

Receiving

NEXT SITES: London – RHUL & QMULScotGrid – Durham SouthGrid – Birmingham & Oxford? NorthGrid – Sheffield? & Liverpool

KEY:Black figures indicate 1TB transferBlue figures indicate <1TB transfer (eg. 10 GB)

http://wiki.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Tests

Page 24: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Additional milestones

LCG File Catalog• LFC document available by end November• LFC installed at 1 site in each Tier-2 by end December • LFC installed at 50% sites by end January • LFC installed at all sites by end February • Database update tests (TBC)

VO Boxes•Depending on experiment responses to security and operations questionnaire and GridPP position on VO Boxes. •VOBs available (for agreed VOs only) for 1 site in each Tier-2 by mid-January •VOBs available for 50% sites by mid-February •VOBs available for all (participating) sites by end March

Experiment Specific Tests (TBC) •To be developed in conjunction with experiment plans – Please make suggestions!

Page 25: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

LCG File Catalog• LFC document available by end November• LFC installed at 1 site in each Tier-2 by end December • LFC installed at 50% sites by end January • LFC installed at all sites by end February • Database update tests (TBC)

VO Boxes•Depending on experiment responses to security and operations questionnaire and GridPP position on VO Boxes. •VOBs available (for agreed VOs only) for 1 site in each Tier-2 by mid-January •VOBs available for 50% sites by mid-February •VOBs available for all (participating) sites by end March

Experiment Specific Tests (TBC) •To be developed in conjunction with experiment plans – Please make suggestions!

Additional milestones

LHCb & ALICE questionnaires received. Accepted and VO boxes deployed at Tier-1.Little use so far – ALICE has not had a disk allocation.ATLAS original response was not accepted. They have since tried to implement VO boxes and found problems so are now looking at a centralised model. CMS do not have VO Boxes but they DO require local VO persistent processes

Page 26: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Getting informed & involved!

The deployment team are working to make sure sites have sufficient information. Coordinate your activities with your Tier-2 Coordintor.

1)Stay up to date via the Storage Group work: http://wiki.gridpp.ac.uk/wiki/Grid_Storage

2) General Tier-1 support: http://wiki.gridpp.ac.uk/wiki/RAL_Tier13) Understand and setup FTS (channels):

http://wiki.gridpp.ac.uk/wiki/RAL_Tier1_File_Transfer_Service4) VO Boxes go via Tier-1 first: http://wiki.gridpp.ac.uk/wiki/VOBox

5) Catalogues (& data management): http://wiki.gridpp.ac.uk/wiki/Data_Management

The status of sites is being tracked here:http://wiki.gridpp.ac.uk/wiki/Service_Challenge_4_Site_Status

Some particular references worth checking out when taking the next step:

6) What RALPP did to get involved: http://wiki.gridpp.ac.uk/wiki/RALPP_Local_SC4_Preparations

8) Edinburgh dCache tests: http://wiki.gridpp.ac.uk/wiki/Ed_SC4_Dcache_Tests9) Glasgow DPM testing: http://wiki.gridpp.ac.uk/wiki/Glasgow

PLEASE CREATE SITE TESTING LOGS – it helps with debugging and information sharing

Page 27: GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

Summary

2 EGEE work will add to information which is published & analysed

3 GridPP & experiments need to work at better use of Tier-2 disk

4 There are changes coming with 2.7.0 & helpdesk upgrade

6 Sites asked to complete reports, reduce tickets & get involved in SC4!

1 Metrics show stability and areas where we can improve

5 Focus has shifted to Service Challenge work (including security)