gridpp project management sarah pearce 15 september 2009 gridpp oversight committee

11
GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

Upload: hannah-carroll

Post on 28-Mar-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

GridPP project management

Sarah Pearce 15 September 2009

GridPP Oversight Committee

Page 2: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

2GridPP Oversight Committee15/9/09

Project Map GridPP3 Q2 09

Page 3: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

3GridPP Oversight Committee15/9/09

Project map - statistics

Metrics Milestones

  Q208 Q308 Q408 Q109 Q209Metric OK 99 142 155 172 184Metric close to target 24 47 39 32 22Metric not OK 41 32 32 21 27Not able to be measured 27 22 11 10 3Milestone achieved 11 22 32 42 57Milestone overdue 2 7 13 17 4Milestone not due / metric n/a 101 80 69 60 58Suspended 0 6 6 9 12Awaiting input 34 5 12 10 3Total 339 363 369 373 370

Page 4: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

4GridPP Oversight Committee15/9/09

Red metricsLHCb• 1.2.2 - MC production (generation) efficiency• 1.2.3 - T1 MC production (reconstruction, stripping) efficiency• 1.2.4: T1 MC/Event user analysis – UK efficiency• 1.2.11 LHCb SAM tests uptime T1• 1.2.23 Keep LHCb GANGA user training material updated

Operations• 2.1.3 - Proportion of available jobslots used• 2.1.6 - Job success rates• 2.1.10 - GridPP deployment web-pages up-to-date - review underway

Tier-1• 3.1.8 - Availability of CE• 3.2.11 - Farm Occupancy• 3.2.13 – Quarterly report not available• 3.4.4 - % met of UB Allocation for Disk•3.4.8 – CASTOR SAM tests: LHC VOs

Tier-2s• SAM availability and reliability tests LondonGrid (4.1.3, 4.1.4).• Metric 5 – SLL ATLAS test performance, LondonGrid and SouthGrid.• 4.2.6 – ScotGrid average SLL SE test performance• Metrics 7&8 - CPU utilisation (wall clock time & CPU time) LondonGrid, SouthGrid• Metric 9 - % of disk used ScotGrid, SouthGrid• 4.4.11 - Number of management meetings NorthGrid• 4.1.14 Middleware upgrading LondonGrid

Project execution• 5.2.9 – CB meetings

Page 5: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

5GridPP Oversight Committee15/9/09

Overdue milestones

Front end systems• 3.1.22 LHC Monitoring infrastructure operational at RAL – waiting on work by

Dante

Resource delivery• 3.2.16 - Disaster and Business Continuity Plan Available.• 3.2.18 - Disaster Plan fully implemented

New disaster management system is operational and working well, but some contingency plans remain to be completed.

Storage systems• General ADS Service Ends. Not been a priority but closure process has

started.

Page 6: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

6GridPP Oversight Committee

Key milestones/ deliverables

• Main requirement will be to deliver for the experiments once LHC data starts – measured by a combination of metrics along the top of the ProjectMap.

15/9/09

Milestone no. Description Owner Deadline

3.3.3 Tier-1 able to meet 2009 WLCG MoU resource commitment Andrew Sansum 31/08/2009

6.2.3 EGI Transition Planning for GridPP Robin Middleton 01/10/2009

5.1.9 Allocations calculated for round 2 of Tier-2 hardware grants Steve Lloyd 31/10/2009

6.2.5Agreement with NGS/NGI on partition of services between

GridPP and NGS/NGIRobin Middleton 01/11/2009

5.1.8 Post-GridPP planning initiated Dave Britton 01/01/2010

3.3.10 2010 Disk Tender Started Martin Bly 02/01/2010

3.3.21 2010 CPU Tender Started Martin Bly 02/01/2010

5.1.10 Grants for 2009 Tier-2 hardware issued Sarah Pearce 31/03/2010

5.1.11 Grants for 2010 Tier-2 hardware issued Sarah Pearce 30/04/2010

Page 7: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

7GridPP Oversight Committee15/9/09

Risk register

Page 8: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

8GridPP Oversight Committee15/9/09

Top 5 risks (I)ID Name Li

Im

Risk

Owner

Current process for managing risk Future mitigation option Costs

R1Recruitment /retention difficulties

3 3 9SP 1. Universities encouraged to advertise early (though some won't allow this until grants are received) 2. Tier-1 staff are on long-term contracts so retention is better.3. Tier-2 coordinators can to some extent cover for missing Tier-2 support posts4. Ensure staff remain motivated5. Building likelihood of turnover and unfilled posts into staffing model, especially at the Tier-1

1. Use contract staff for well defined tasks and short periods2. Try to establish future funding early, to aid with retention3. Escalate to PMB and/or Director e-Science at RAL

Possible extra costs for employing contract staff - but could be offset by underspend in other posts.

R5Service insufficiently resilient wrt storage

2 4 8JC 1. Tier-1 storage review analysed issues with CASTOR2. Extra staff member has been appointed in D/B CASTOR area.3. Monitoring of available disk space. 4. Procurements take account of experiment requests.

1. Attention to be concentrated on the Oracle database system to ensure that it operates at an appropriate load and to engage Oracle better.2. Have T1 excess disk capacity available at short notice (i.e. procure more or make data center arrangements). 3. Experiment experts embedded with CASTOR team

1 extra staff member employed for FY09 and FY10. Estimated cost of 140k met by reprofiling RAL staff costs.

R10

Hardware resources inadequate/insufficient

2 4 8GP Quarterly review of resources and priorities at UB meetings. Weekly review of storage resources at Castor meetings. Ability to redefine intra-experiment CPU fairshares at short notice.

Purchase more hardware and/or improve profiling and procurement.Reduce non-LHC experiment resources. Agree programme priorities through PMB and STFC.

If necessary, we would aim to reprofile Tier-1 hardware funds to meet requirements.

Page 9: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

9GridPP Oversight Committee

Top 5 risks (II)

1/4/09

R12 Machine room problems compromise Tier-1

4 3 12AS1. Two separate issues appearing in the week of Aug 10th (an aircon failure and a small water leak due to condensation) are currently being managed by the Tier-1 Disaster Management protocol. 2. Ensure Involvement at all levels of project. e-Science department is project sponsor.

1. Re-certification of the machine room environment by an independent consultant. 2. Return Tier-1 to the ATLAS building. 3. If problem occurs after migration has been completed (for example air conditioning unreliable). Seek remedy from builder. Rent airconditioning units if necessary. Run critical services from UPS generator or move small volumes back to old building.

 

R14 Network/OPN breakage

2 4 8PC

1. Existing practice for network outages, i.e. "talk with your neighbour". For the OPN- the link layer neighbour is JANET(UK), the routing layer neighbour is CERN.2. LHCOPN Operational Handbook describes and defines responsibilities https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel3. Plan for N Gbit/s back up provision.

1. Fully funded second N Gbit/s backup provision across a diverse route provides fallback routing on link failure2. Disaster recovery plans exist for the RAL network components that are used on the path from the OPN to the Tier 1.3. Service Continuity Plan exists to manage a "wider crisis" (separate document)

GridPP proposes to spend £52k of our existing hardware budget to install a backup link, supported by a recurrent cost of between £40k and £60k per annum, depending on negotiations about the end-point costs.

Page 10: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

10GridPP Oversight Committee15/9/09

Finances

• £335k of Tier-1 hardware rolled over from FY08 to FY09, as a result of planning for LHC reschedule and R89 delay.

• £1m of Tier-1 hardware delayed until FY10, at request of STFC.

• At request of STFC, most Tier-2 hardware grants should be in early FY10 – small number of sites require in FY09

Page 11: GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee

11GridPP Oversight Committee15/9/09

Staffing

• Some areas not finished recruiting, so funded effort under that expected

• But in all cases more than compensated by unfunded effort