gridpp status report david britton, 15/sep/09. 2 introduction since the last oversight: the uk has...

26
GridPP Status Report David Britton, 15/Sep/09

Upload: hector-jefferson

Post on 12-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

GridPP Status Report

David Britton, 15/Sep/09

Page 2: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

2

IntroductionSince the last Oversight:

• The UK has continued to be a major contributor to wLCG

• A focus on resilience and disaster management (GridPP22)

• The UK infrastructure has been validated by STEP09.

• Moved the Tier-1 to R89.

• Procured significant new hardware.

• Adapted to developments in the LHC schedule; the EGI+ proposals; and the UK funding constraints.

Issues from the last Oversight:

• “Other Experiments.”

• EGI/NGI/NGS etc.

• CASTOR.

• OPN network.

To be covered by Project Manager:

• Project Milestones/Deliverables.

• Project Risks.

• Project Finances.

Page 3: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

3

WLCG: Largest scientific Grid in the world

September 2009:

>315,000 KSI2K

Worldwide: 288 sites in 55 countries – 190,000 CPUs

In the UKI: 22 sites and about 19,000 CPUs

15/Sep/09

Page 4: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

4

UK CPU Contribution

Same picture if non-LHC VOs included

15/Sep/09

Page 5: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

58/Sep/09

UK Site Contributions

2007 – 8 - 9

NorthGrid: 34% – 22% - 15%

London: 28% – 25% - 32%

ScotGrid: 18% – 17% - 22%

Tier-1: 13% – 15% - 13%

SouthGrid: 7% – 16% - 13%

GridIreland: 0% – 6% - 5%

All areas of the UK make valuable contributions

“Other VOs” used 16% of the CPU time this year.

Page 6: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

6

UK Site Contributions: Non LHC VOs

15/Sep/09

All regions supported the“Other VOs”.

Top-12 “Other VOs” include many disciplines

Page 7: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

7

Tier-2 Resources

1/Apr/09

The Tier-2s have delivered (Brunel currently installing 600TB of disk)

Accounting error: 230TB delivered.

Page 8: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

8

Tier-2 Performance

Resource-weighted averages

8/Sep/09

The Tier-2s have improved and are performing well.

Page 9: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

9

Service Resilience

GridPP23 Agenda

15/Sep/09

A sustained push was made on improving service resilience at all levels.

Many improvements were made at many sites and, ultimately, STEP09 demonstrated the the UK Grid was ready for data (see later slide).

Disaster management processes were developed and are regularly engaged (see later slide).

Page 10: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

10

STEP09 UK Highlights

– RAL was the best ATLAS Tier-1 after the BNL ATLAS-only Tier-1

– Glasgow ran more jobs then any of the 50-60 ATLAS Tier-2 sites throughout the world.

– Tier-2 sites made good contributions and were tuning (not fire-fighting) during STEP09 and subsequent testing.

– Quote: “The responsiveness of RAL to CMS during STEP09 was in stark-contrast to many other Tier-1s.”

– CMS noted the tape performance at RAL was very good as was the CPU efficiency (CASTOR 2.1.7 worked well).

– Many (if not all) the metrics for the experiments were met, and in some cases, significantly exceeded at RAL during STEP09.

15/Sep/09

Page 11: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

11

STEP09: RAL Operations Overview

• Generally very smooth operation: – Most service systems relatively unloaded plenty of spare

capacity – Calm atmosphere.

• Daytime “production team” monitored service• Only one callout, • Most of the team even took two days out off site for department

meeting!– Very good liaison with VOs and good idea what was going on.

• In regular informal contact with UK representatives– Some problems with CASTOR tape migration (3 days) on

ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments.

• Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly.

• Very useful exercise – learned a lot, but very reassuring – More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

Page 12: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

12

STEP09: RAL Batch Service

• Farm typically running > 2000 jobs. By 9th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%)

• Problem 1: ATLAS job submission exceeded 32K files on CE– See hole on 9th. We thought ATLAS had paused took time to spot.

• Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. – Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed

by 9th June. See decrease in (red) ALICE work.• Problem 3: Occupancy initially poor (initially 90%). Short on

memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

Page 13: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

13

Data Transfers

15/Sep/09

RAL achieved the highest average input and output data rates of any Tier-1.

Page 14: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

14

OPN Resilience

15/Sep/09

Page 15: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

15

In the end, hand-over to STFC was delay from Dec to Apr 09. Hardware was delayed but we were (almost) rescued by the LHC schedule change. Minor (?) issues remain with R89 (Aircon-trips; water-proof membrane?)

(GridPP22) Current Issues: R89

15/Sep/09

Page 16: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

16

Tier-1 Hardware

• The FY2008 hardware procurement had to await the acceptance of R89.

• The CPU is tested, accepted, and being deployed (14,000 HEPSPEC06 to add to current 19,000)

• The disk procurement (2 PB to add to existing 1.9PB) was split into two halves (different disks and controllers to mitigate against acceptance problems). This has proved sensible, as one batch has demonstrated ejection issues.

• One half of the disk is being deployed; progress is being made on the other half and best guess is deployment by end of November.

• A second SL85000 tape robot is available.• The FY09 hardware procurement is underway.

15/Sep/09

Page 17: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

17

Disaster Management

• A four-stage disaster management process was established at the Tier-1 earlier this year as part of our focus on resilience and disaster management.

• Designed to be used regularly so that process is familiar. This means low-threshold to trigger Stage-1 “disasters”

• At Stage-3, the process formally involves stake-holders outside the Tier-1, including GridPP management. This has now happened several times including:– R89 aircon trip– R89 water leak– Disk procurement problem– Swine flu planning.

• The process is still being honed, but I believe it is very useful.

15/Sep/09

Page 18: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

18

- NGI

- NGI

- NGI

EGI/NGI

EGI

UK-NGI

Coordinating body in Amsterdam

National initiatives in member countries

GridPP

NGS

Involves STFC, EPSRC and JISC (at least) in the UK.

EGI is vital to GridPP but it is not GridPP’s core business to run an e-science infrastructure for the whole of the UK: seek a middle ground.15/Sep/09

Page 19: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

19

EU Landscape

SSC EMI

EGIHeavy Users

SSCSSC

(Roscoe) UnicoreARC

gLiteUK involvement with Ganga?

UK involvement via the UK NGI with global tasks such as GOGDB, security, dissemination, training....

UK involvement with APEL, GridSite? …

UK involvement: FTS/LFC support post at RAL?

15/Sep/09

Page 20: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

20

User Support

• Help pages.

• GridPP23 talks.

• User survey at RAL

Page 21: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

21

Actions

• OPN – Detailed document provided. Cost is covered by existing GridPP hardware funds. Propose to proceed immediately to provision.

• Other Experiments – Usage shown on Slide-6. Allocation Policy is on the UserBoard web-pages: http://www.gridpp.ac.uk/eb/allocpolicy.html

• EGI/NGI/NGS – Paper provided. GridPP/UK has established potential links with all the structural units and is engaged in the developments.

• CASTOR – Paper provided. Paper provided. Version 2.1.7 used during STEP09 worked well beyond the levels needed. 2.1.8 becoming an issue.

15/Sep/09

Page 22: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

22

Current Issues

Operational:• Timing of CASTOR 2.1.8 upgrade.• Shake-down issues with R89.• Problem with 50% of current disk purchase.

High Level:• Hardware planning – lack of clarity on approved global

resources.• Hardware pledges – financial constraints and the 2010

pledges.• GridPP4 – lack of information on scope, process or

timing against a backdrop of severe financial problems within STFC.

15/Sep/09

Page 23: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

23

Key issue in the next six months

To receive a sustained flow of data from CERN and to meet all

the experiment expectations associated with custodial storage;

data reprocessing; data distribution; and analysis.Requires:

• A resilient OPN network

• Stable operation of CASTOR storage

• Tier-1 hardware and services

• Tier-1 to Tier-2 networking

• Tier-2 hardware and services

• Help, support, deployment and operations.

That is, the UK Particle Physics Grid.

15/Sep/09

• The milestones necessary to meet these requirements have been met (with the possibly exception of the first) and the entire system validated with STEP09.

• We believe the UK is ready.

• We know that problems will arise and have focused on resilience to reduce the incidence of these, and on disaster management to handle those that do occur.

Page 24: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

24

The End

Page 25: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

25

Schedule

It is foreseen that LHC will ready for beam by mid-November

Before that All sectors powered separately to operating energy ++ Dry runs of many accelerator systems (from Spring)

Injection, extraction, RF, collimators Controls

Full machine checkout before taking beam

Beam tests TI8 (June) TI2 (July) TI2 and TI8 interleaved (September) Injection tests (late October)

Page 26: GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus

261/Apr/09