gridpp3 david britton 6/september/2006. d. britton6/september/2006gridpp3
TRANSCRIPT
GridPP3David Britton
6/September/2006
6/September/2006 GridPP3 D. Britton
6/September/2006 GridPP3 D. Britton
Overview
The GridPP3 proposal consists of a 7-month extension to GridPP2, followed by a three year GridPP3 project starting in April 2008.
GridPP2+ (7 month extension from September 2007 to March 2008) - Early approval sought in order to ensure staff retention. - Provides continuity of management and support over the LHC start-up. - Aligns the project with (a) financial year; (b) EGEE and other EU projects.
GridPP3 (3 year project from April 2008 to March 2011) - “From production to exploitation.”
- Delivers large-scale computing resources in a supported environment. - Underpins the success of the UK contribution to the LHC.
6/September/2006 GridPP3 D. Britton
Global Context
2001 2002 2003 2004 2005 2006 2007
EDG EGEE-I EGEE-IILHC Data Taking
GridPP1 GridPP2 GridPP3
EGI ?
GridPP
EDGEGEE
LCG
(Many)
Evolving standardsDeveloping requirements
Changing Costs and budgets Experience
wLCG
6/September/2006 GridPP3 D. Britton
WLCG MoU
• 17 March 2006: PPARC signed the Memorandum of Understanding with CERN
• Commitment to UK Tier-1 at RAL and the four UK Tier-2s to provide services and resources
• Current MoU signatories:China France Germany Italy India Japan Netherlands Pakistan Portugal Romania Taiwan UK USA
• Pending signatures: Australia Belgium Canada Czech Republic Nordic Poland Russia Spain Switzerland Ukraine
6/September/2006 GridPP3 D. Britton
Aim: by 2008 (full year’s data taking)
- CPU ~100MSi2k (100,000 CPUs)
- Storage ~80PB - Involving >100 institutes
worldwide- Build on complex middleware
in Europe (Glite) and in the USA (VDT)
1. Prototype went live in September 2003 in 12 countries
2. Extensively tested by the LHC experiments in September 2004
3. 197 sites, 13,797 CPUs, 5PB storage in September 2005
4. 177 active sites, 26,527 CPUs, 10PB storage in September 2006
Grid Overview
6/September/2006 GridPP3 D. Britton
Tier-0 to Tier-1
• worldwide data transfers > 950MB/s for 1 week
• peak transfer rate from CERN of >1.6GB/s
• Ongoing experiment transfers as part of current service challenges
6/September/2006 GridPP3 D. Britton
Tier-1 to Tier-2
• UK data transfers >1000Mb/s for 3 days• peak transfer rate from RAL of >1.5Gb/s• Require high data rate transfers (300-
500Mb/s) to/from RAL as a routine activity
6/September/2006 GridPP3 D. Britton
It’s in use:Active Users by LHC experiment
ALICE (8)
CMS (150)
ATLAS (70)
LHCb (40)
6/September/2006 GridPP3 D. Britton
Tier Centres
RAWRAW
ESDESD
AODAOD
TAGTAG
““Interesting EventsInteresting Events”” List List
RAWRAW
ESDESD
AODAOD
TAGTAG
RAWRAW
ESDESD
AODAOD
TAGTAG
TierTier--00(International)(International)
TierTier--11(National)(National)
TierTier--22(Regional)(Regional)
TierTier--33(Local)(Local)
DataFiles
DataFiles
DataFiles
TAGData
DataFilesData
FilesDataFiles
RAWDataFile
DataFilesData
FilesESDData
DataFilesData
FilesAODData
Event 1 Event 2 Event 3
RAWRAW
ESDESD
AODAOD
TAGTAG
““Interesting EventsInteresting Events”” List List
RAWRAW
ESDESD
AODAOD
TAGTAG
RAWRAW
ESDESD
AODAOD
TAGTAG
TierTier--00(International)(International)
TierTier--11(National)(National)
TierTier--22(Regional)(Regional)
TierTier--33(Local)(Local)
DataFiles
DataFiles
DataFiles
TAGData
DataFilesData
FilesDataFiles
RAWDataFile
DataFilesData
FilesESDData
DataFilesData
FilesAODData
Event 1 Event 2 Event 3
Tier-0 Tier-1 Tier-2
ALICE Reconstruction
On-demand analysis
Central simulation
On-demand analysis
ATLAS Reconstruction
Scheduled analysis/ skimming
Calibration
Simulation
On-demand analysis
Calibration
CMS Reconstruction
Scheduled analysis/ skimming
Simulation
On-demand analysis
Calibration
LHCb
First-pass scheduled reconstruction
Reconstruction
On-demand analysis
Scheduled skimming
Simulation
6/September/2006 GridPP3 D. Britton
LHC Hardware Requirements
Resource ALICE ATLAS CMS LHCb Required 12.3 24.0 15.2 4.4 Non-UK Pledged 54% 89% 73% 85%
0.16 3.00 1.56 0.74
Tier-1 CPU
[MSI2k] GridPP3
1% 13% 10% 17%
Required 7.4 14.4 7.0 2.4 Non-UK Pledged 36% 81% 75% 77%
0.11 1.78 0.84 0.41
Tier-1 Disk
[PB] GridPP3
1% 12% 12% 17%
Required 6.9 9.0 16.7 2.1 Non-UK Pledged 45% 90% 53% 74%
0.10 1.12 1.44 0.35
Tier-1 Tape
[PB] GridPP3
1% 12% 9% 17%
Required 14.4 19.9 19.3 7.7 Non-UK Pledged 41% 83% 90% 38%
0.18 2.66 1.80 2.17
Tier-2 CPU
[MSI2k] GridPP3
1% 13% 9% 28%
Required 3.5 8.7 4.9 n/a Non-UK Pledged 39% 63% 92% n/a
0.05 1.14 0.40 n/a
Tier-2 Disk
[PB] GridPP3
1% 13% 8% n/a
ALICE: Based on UK M&O author fraction (1.2%).
ATLAS: Based on UK fraction of Tier-1 Authors.
CMS: Based on a threshold size for a minimum viable Tier-1.
LHCb: Based on Authorship fraction (16.5%) and number of Tier-1s.
Overall resource level reviewed by LHCC.
Balance of CPU, Storage, and Network driven by computing models.
6/September/2006 GridPP3 D. Britton
Non-LHC Hardware Requirements
BaBar: Included explicitly, based on well understood resource requirement per fb-1 and the expected luminosity profile up to October 2008. Level is ~15% of Tier-1 CPU and Tape, and 9% Disk in 2008.
UKQCD: Request received after planning stage completed so not included in the model. (Some uncertain whether UKQCD will move to LCG-based Grid and how manpower would be funded). Level is 3%-4% of Tier-2 resources and ~7% of Tier-1 tape in 2008.
Others: The requirements of other, smaller, user groups and some provision for future larger groups (LC, Neutrino) where the requirements are currently largely unknown, have been addressed with a 5% envelope allocation of Tier-2 Disk and CPU, and Tier-1 Tape.
6/September/2006 GridPP3 D. Britton
Budget Overview
Tier-225%
Support13%
Operations.6%
Management3%
Outreach1%
Tier-150%
Travel+Other2%
Cost Table TOTAL [£m]
Tier-1 Staff
Hardware
4.99
11.72
Tier-2 Staff
Hardware
3.29
5.12
Grid Support Staff 4.50
Grid Operations Staff 1.89
Management 1.17
Outreach 0.37
Travel and Other 0.84
Project Sub-Total 33.90
Working Allowance (4%) 1.25
Project Cost 35.15
Contingency (12%) 4.15
Tier-1 Running Costs 2.50
Full Approval Cost 41.80
6/September/2006 GridPP3 D. Britton
Tier-1 Centre
CPU to
disk Disk to
CPU Disk to
tape Tape to
disk T0 to T1
T1 to T1
(T1 to T0) T2 to T1 T1 to T2
ATLAS 940 2361 264 34 610 165 25 105
CMS 423 1590 242 45 240 240 58 360
LHCb 212 278 63 54 184 184 3 0
Total 1130 4229 505 133 752 589 86 465
Tier-1 disk capacity by expt
0
2000
4000
6000
8000
10000
2008 2009 2010 2011 2012
ALICE
ATLAS
BABAR
CMS
LHCb
Others
Tier-1 CPU capacity by expt
02000400060008000
1000012000140001600018000
2008 2009 2010 2011 2012
ALICE
ATLAS
BABAR
CMS
LHCb
Others
Defined by the experiment hardware requirements, the experiment computing models, a hardware costing model, and by the service levels defined in the international MOU signed by PPARC
Tier-1 tape capacity by expt
0100020003000400050006000700080009000
2008 2009 2010 2011 2012
ALICE
ATLAS
BABAR
CMS
LHCb
Others
Estimated Tier-1 peak data flows in 2008 [MB/s]
6/September/2006 GridPP3 D. Britton
Tier-1 Centre: Service Level
Maximum delay in responding to operational problems
Average availability measured on an annual
basis
Service
Service interruption
Degradation of the capacity of the service
by more than 50%
Degradation of the capacity of the service
by more than 20%
During accelerator operation
At all other times
Acceptance of data from the Tier-0 Centre
12 hours 12 hours 24 hours 99% n/a
Networking service to the Tier-0 Centre during accelerator operation
12 hours 24 hours 48 hours 98% n/a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres
24 hours 48 hours 48 hours 98% 98%
All other services – prime service hours
2 hour 2 hour 4 hours 98% 98%
All other services – other times
24 hours 48 hours 48 hours 97% 97%
6/September/2006 GridPP3 D. Britton
Tier-1 Centre: Staff
GridPP3 Work Area PPARC
funding CCLRC funding
CPU 2.0 0.0
Disk 3.0 0.0
Tape and CASTOR 2.0 1.3
Core Services 1.0 0.5
Operations 3.0 1.0
Incident Response Unit
3.0 0.0
Networking 0.0 0.5
Deployment 1.5 0.0
Experiments 1.5 0.0
Tier-1 Management 1.0 0.3
Totals 18.0 3.6
Core services refer to user-file systems, monitoring, software deployment and conditions database.
Operations refers to machine-room environment, hardware diagnostics/repair, automation, fabric management, tape-movement etc.
Incident Response Unit addresses MOU service requirement including out-of-hours call out.
6/September/2006 GridPP3 D. Britton
Tier-2 Centres
GridPP has successfully developed four distributed Tier-2 Centres which have: - Engaged the institutes; - Levered large amounts of resources; - Developed local expertise; - Stimulated cross-disciplinary relationships; - Help promote the Grid, GridPP, Particle Physics, and the local groups within the universities.
Successes: Development of regional management structure; MOU signed by each institute with GridPP; deployment of complex middleware; accounting; security; data-transfers; all fully operational and contributing to LCG.
6/September/2006 GridPP3 D. Britton
Tier-2 Centres
To match the LHC computing models around 50% of the UK computing resources will be located at the Tier2s. Service levels are not as demanding as at the Tier-1.
Maximum delay in responding to operational problems
Service
Prime time Other periods
Average availability measured on an
annual basis End-user analysis facility 2 hours 72 hours 95% Other services 12 hours 72 hours 95%
Distributed nature of the UK Tier-2 has technical advantages (“divide and conquer”) and technical drawbacks (“inefficiencies”). Importance of political/social aspects should not be underestimated.
Tier-2 total CPU capacity by expt
0
2000
4000
6000
8000
10000
12000
14000
2008 2009 2010 2011 2012
ALICEATLASBABARCMSLHCbOthersUKQCD
Tier-2 total disk capacity by expt
0
1000
2000
3000
4000
5000
6000
7000
2008 2009 2010 2011 2012
ALICEATLASBABARCMSLHCbOthersUKQCD
6/September/2006 GridPP3 D. Britton
Tier-2 Market Model
1) Assume all Institutes involved are interested in building on their current contribution so that…
2) Effectively a “market” exists to provide Tier-2 resources to HEP (because many Institutes have dual-funding opportunities and/or internal reasons to be involved).
3) GridPP offers a market-price for Tier-2 resources which institutes may or may not chose to accept.
4) The market price is adjusted to optimise resources obtained.5) The market price is bounded by what it would cost to
provision the resources at the Tier-1.
Inefficiencies associated with the distributed nature of the Tier-2s may be balanced by an increase in competition/leverage.
6/September/2006 GridPP3 D. Britton
Tier-2 Hardware Allocations
Constrained by the requirement for Institutional JeS forms GridPP made an initial mapping (or allocation – i.e. not quite the “market” approach intended) of Tier-2 hardware.
ATLAS CMS LHCb Other
London 0.25 0.75 0.10 0.30
NorthGrid 0.50 0.00 0.20 0.40
ScotGrid 0.15 0.00 0.30 0.10
SouthGrid 0.10 0.25 0.40 0.20
ATLAS CMS LHCb Other ATLAS CMS LHCb Other
Brunel 0.00 0.10 0.00 0.15 Birmingham 0.40 0.00 0.00 0.30
Imperial 0.00 0.90 1.00 0.00 Bristol 0.00 0.50 0.25 0.20
QMUL 0.70 0.00 0.00 0.60 Cambridge 0.25 0.00 0.25 0.15
RHUL 0.20 0.00 0.00 0.15 Oxford 0.25 0.00 0.25 0.25
UCL 0.10 0.00 0.00 0.10 RAL PPD 0.10 0.50 0.25 0.10
Lancaster 0.50 0.00 0.00 0.50 Durham 0.10 0.00 0.10 0.25
Liverpool 0.20 0.00 1.00 0.15 Edinburgh 0.00 0.00 0.40 0.25
Manchester 0.20 0.00 0.00 0.20 Glasgow 0.90 0.00 0.50 0.50
Sheffield 0.10 0.00 0.00 0.15
Allocations based on past-delivery; current size; and size of the local community of physicists.
Fraction of Experiment allocated to each Tier-2
Relative fraction of Experiment allocated to each Institute within the Tier-2
6/September/2006 GridPP3 D. Britton
Tier-2 Staff Allocations
GridPP currently funds 9 FTE at 17 institutes. In GridPP3, this is proposed to increase to 14.75 FTE (c.f. Tier-1 has 18 FTE funded by GridPP3 for a comparable amount of hardware). Again, in this market approach this is the “effort (currently) offered” and not an estimate of the “full effort needed”.
London FTE NorthGrid FTE ScotGrid FTE SouthGrid FTE
Brunel 0.50 Lancaster 1.50 Durham 0.25 Birmingham 1.00
Imperial 1.50 Liverpool 1.00 Edinburgh 0.50 Bristol 1.00
QMUL 1.00 Manchester 1.50 Glasgow 1.00 Cambridge 0.50
RHUL 0.50 Sheffield 1.00 Oxford 0.50
UCL 1.00 RAL PPD 0.50
Total 4.50 5.00 1.75 3.50
6/September/2006 GridPP3 D. Britton
Tier-2 Hardware Costs (Agreed by CB)
CPU (KSI2K)
2007 2008 2009 2010 2011 2012
Requirement 7560 10215 14522 18203 21708
Amount paid for 1559 2106 2994 3753 4476
Unit Cost £0.392k £0.312k £0.247k £0.175k £0.124k £0.087k
Cost £k £612k £656k £740k £656k £553k £0k
Total (inc Disk)£1,163
k£1,295
k£1,383
k£1,282
k£1,120
k £0k
• Take requirement in following year 7560 divided by the lifetime in years (4.85 CPU, 3.9 Disk) = 1559
• Multiply by the unit cost in that year £0.392k/KSI2K = £612k
• Similarly for disk.• Up to institutes how they spend it (new kit,
replacement kit, central services … )
6/September/2006 GridPP3 D. Britton
Tier-2 Resources
Sanity Checks:
1) Compare cost to GridPP of hardware at the Tier-1 and Tier-2 integrated over the lifetime of the project.
2) Total cost to project: Can compare (Staff + Hardware) cost of the Tier-2 facilities with the cost to the project of placing the same hardware at the Tier-1 (assuming that doubling the Tier-1 hardware requires a 35% increase in staff).
Tier-1 Tier-2CPU (K£/KSI2K-year): 0.070 0.045DISK (K£/TB-year): 0.144 0.109TAPE (K£/TB-year): 0.052
Including staff and hardware, the cost of the Tier-2 facilities is ~80% of cost of an enlarged Tier-1.
6/September/2006 GridPP3 D. Britton
Budget Overview
Tier-225%
Support13%
Operations.6%
Management3%
Outreach1%
Tier-150%
Travel+Other2%
Cost Table TOTAL [£m]
Tier-1 Staff
Hardware
4.99
11.72
Tier-2 Staff
Hardware
3.29
5.12
Grid Support Staff 4.50
Grid Operations Staff 1.89
Management 1.17
Outreach 0.37
Travel and Other 0.84
Project Sub-Total 33.90
Working Allowance (4%) 1.25
Project Cost 35.15
Contingency (12%) 4.15
Tier-1 Running Costs 2.50
Full Approval Cost 41.80
6/September/2006 GridPP3 D. Britton
Grid Support
Refers to staff effort for the support of Middleware, Security and Networking areas in GridPP3. The emphasis is on a managed transition from middleware development to middleware support (operational and bug-fixing).
Three criteria applied to guide prioritisation of areas for support:
1) Areas which are “mission critical” for the UK.2) Areas which are viewed as established “international
obligations”.3) Areas which provide significant leverage to the obvious
advantage of GridPP
Background documents discuss areas in terms of:a) Operational Supportb) Maintenance (bug-fixing)c) Development (phased out where practical).
6/September/2006 GridPP3 D. Britton
Area Role FY08 FY09 FY10 Operational Service Support
1.0 1.0 1.0
Metadata 1.0 1.0 1.0
Grid Data Management
Replica Management 1.0 1.0 1.0 Castor Support 1.0 1.0 1.0 DPM Support 1.0 1.0 0.5
Storage
dCache Support 1.0 0.5 0.5 Operational Service Support
1.0 1.0 1.0
R-GMA & Support 3.0 1.5 0.6 Service Discovery Support
0.5 0.3 0.2
Information & Monitoring
GLUE & International Collaboration
0.5 0.2 0.2
Performance 1.0 0.5 0.5 Testing 0.5 0.5 0.5 Real-Time Monitoring 0.5 0.5 0.5
Workload, Performance & Portal
Portal Support 0.5 0.5 0.5 Operational Security Officer
1.0 1.0 1.0
GridSite Support 1.5 1.5 1.5 VOMS Support 0.5 0.5 0.5
Security
International Security Co-ordination
0.8 0.8 0.8
Requirements & Provisioning
0.5 0.5 0.5 Networking
Performance Monitoring
0.5 0.5 0.5
TOTALS 18.3 15.3 13.8
Grid Support Areas
6/September/2006 GridPP3 D. Britton
Grid Support Staff Evolution
GridPP2 GridPP2+ GridPP3 GridPP2 GridPP EGEE GridPP EGEE FY08 FY09 FY10 GridPP3
Tier-2 Expert 1.0 1.0 WLMS
MSN 1.0 1.0 Portal Apps. Interface 1.0 1.0
2.5 2.0 2.0 Workload Performance and Portal
Tier-2 Expert 1.0 1.0 Data Management MSN 1.0 1.0
3.0 3.0 3.0 Data Management
Tier-2 Expert 1.0 1.0 Data Storage
MSN 2.0 2.0 3.0 2.5 2.0 Storage
Tier-2 Expert 1.5 1.5 Security
MSN 3.5 3.5 3.8 3.8 3.8 Security
Tier-2 Expert 0.0 0.0 InfoMon
MSN 3.5 3.5 3.5 3.5 5.0 3.0 2.0 InfoMon
Tier-2 Expert 0.5 1.5 0.5 1.5 Network
MSN 2.0 2.0 1.0 1.0 1.0 Networking
HP Post Tier-2 Expert 0.5 0.0 0.0 0.0 0.0
GRAND TOTAL 24.5 24.0 18.3 15.3 13.8
6/September/2006 GridPP3 D. Britton
Grid Operations
Team of 8.5 FTE consisting of: - 1 Production Manager; - 4 Tier-2 Coordinators; - 3 to run the UK/GridPP Grid Operations Centre (GOC). - 0.5 FTE to coordinate technical documentation.
Responsible for the deployment, operation, and support of UK Particle Physics environment. Production Manager is responsible for resolving technical and coordination issues that span the Tier1 and Tier2s and ensuring a stable production services with appropriate upgrades to improve functionality and quality.
The current GOC (5.5 FTE funded by EGEE) is responsible for monitoring the world-wide Grid operations, providing trouble tickets, accounting services, and administrative tools.
6/September/2006 GridPP3 D. Britton
Operations Posts
Area Role FY08 FY09 FY10
Production Manager 1.0 1.0 1.0
Tier-2 Technical Coordinators (one for each of the 4 regional centres)
4.0 4.0 4.0
Grid Deployment
Technical Documentation 0.5 0.5 0.5
Monitoring of LCG operations in the UK
1.0 1.0 1.0
Grid Accounting 1.0 1.0 1.0
Grid Operations
International Coordination 0.5 0.5 0.5
Security Risk Management 0.5 0.5 0.5
TOTALS 8.5 8.5 8.5
6/September/2006 GridPP3 D. Britton
Budget Overview
Tier-225%
Support13%
Operations.6%
Management3%
Outreach1%
Tier-150%
Travel+Other2%
Cost Table TOTAL [£m]
Tier-1 Staff
Hardware
4.99
11.72
Tier-2 Staff
Hardware
3.29
5.12
Grid Support Staff 4.50
Grid Operations Staff 1.89
Management 1.17
Outreach 0.37
Travel and Other 0.84
Project Sub-Total 33.90
Working Allowance (4%) 1.25
Project Cost 35.15
Contingency (12%) 4.15
Tier-1 Running Costs 2.50
Full Approval Cost 41.80
6/September/2006 GridPP3 D. Britton
GridPP3 Structure
Project Management Board (PMB)
Oversight Committee
(OC)
Collaboration Board (CB)
Deployment Board (DB)
User Board (UB)
Provision Utilisation
Review
React
Earth
WindW
ate
r
Fire
6/September/2006 GridPP3 D. Britton
GridPP2 GridPP2+ GridPP3
Post FY08 FY09 FY10 Post
Project Leader 0.67 0.67 0.90 0.90 0.90 Project Leader
Project Manager 0.90 0.90 1.00 1.00 1.00 Project Manager
T2 Coordinator 0.50 0.50
DB Chair 0.30 0.30 0.40 0.40 0.40
Deployment Coordinator
UB Chair 0.00 0.00 0.25 0.25 0.25 UB Chair
Middleware Coordinator
0,50 0.50
Application Coordinator
0.50 0.50 0.40 0.40 0.40
Technical Coordinator
CCLRC Management
0.50 0.50 0.50 0.50 0.50 CCLRC Management
Sub-Total 3.87 3.87 3.45 3.45 3.45
Management
Continuity
TD
DB
SL
SP
6/September/2006 GridPP3 D. Britton
Outreach
Currently a Dissemination and an Events Officer (1.5 FTE). Instructions in the PPARC call include the statement:
“It is expected that a plan for collaboration with industry will be presented or justification if such a plan is not appropriate.”
Therefore, broaden mandate to include industrial liaison without increasing manpower but add 0.5 FTE to this area from current documentation officer to handle user documentation and web-site maintenance. Overall team of 2 FTE responsible for:
-Dissemination activities (news, press-releases, liaison with partners, etc.)-Event organisation (demos, publicity, etc.)-Industrial liaison (to be developed.)-Basic user documentation and website maintenance.
6/September/2006 GridPP3 D. Britton
GridPP3 Posts
GridPP2 GridPP2+ GridPP3 GridPP2
GridPP EGEE GridPP EGEE FY08 FY09 FY10 GridPP3
Management All Management posts 3.87 3.87 3.45 3.45 3.45 Management
Tier-1 All Tier-1 Services 13.50 16.00 18.00 18.00 18.00 Tier-1
Hardware Support 9.00 9.00 14.75 14.75 14.75 Tier-2 Tier-2 Specialist Posts 5.50 1.50 5.00 1.50
Middleware All MSN Posts 13.00 3.50 13.00 3.50 Applications All Application Posts 18.50 1.00
18.30 15.30 13.80 Support
Operations Manager 1.00 1.00 1.00 1.00 1.00 Tier-2 Coordinators 0.00 4.00 0.00 4.00 4.00 4.00 4.00 Operations
GOC Posts 0.00 5.50 0.00 5.50 3.00 3.00 3.00 0.50 0.50 0.50
Operations
Documentation Documentation Officer 1.00
1.00 0.50 0.50 0.50
Dissemination Dissemination + Events 1.50 1.50 1.50 1.50 1.50 Outreach
TOTAL 81.37 65.87 65.00 62.00 60.50 TOTAL
6/September/2006 GridPP3 D. Britton
Travel and Other Costs
Based on experience in GridPP2 we have budgeted £3.5k per FTE per annum for travel, a reduction of about 10%, to cover collaboration meetings, national and international conferences and workshops, technical meetings, management meetings, etc.
“Other Costs” of £15k per annum have been included for outreach expenses and other operational expenses (licences, laptops, test machines, web server, software etc).
6/September/2006 GridPP3 D. Britton
Total Costs [k£]
Work Package FY07 FY08 FY09 FY10 Total
Tier-1 Staff 693.47 1384.21 1432.66 1482.80 4993 A
Hardware 3810.50 2621.22 3025.76 2265.54 11723
Tier-2 Staff 147.84 1008.59 1047.87 1088.90 3293 B
Hardware 1163.24 1294.63 1382.60 1281.84 5122
C Support 695.81 1416.04 1232.93 1155.33 4500
D Operations 43.34 592.72 614.84 637.85 1889
E Management 194.35 311.78 324.97 338.78 1170
F Outreach 59.97 99.72 103.58 107.61 371
G Travel and Other Costs
134.48 242.50 232.00 226.75 836
Total 6943.00 8971.41 9397.21 8585.39 33897
6/September/2006 GridPP3 D. Britton
Risks
# Name Likeli- hood (1-4)
Impact
(1,2,3,5)
Risk
(L x I)
Action to Mitigate Risk
1 Insufficient funding. 3 5 15 Present requirements.
PPRP to advise on strategic priorities.
2 Hardware costing. Hardware prices don’t fall as anticipated.
2 3 6 Delay if possible or de-scope if necessary.
3 Tier-2 market fails. 2 3 6 Increase Tier-2 hardware price and/or Tier-2 staffing level.
4 Tier-1 fails to meet service level required.
3 2 6 Increase Tier-1 staffing level.
5 Tier-2s fail to meet service level required
3 2 6 Increase Tier-2 staffing level.
6 Middleware fails. 2 3 6 Mitigated by experiment specific solutions. Work with partners to address shortcomings. Re-target support effort.
7 Industrial take-up low. 3 1 3 Facilitated by Industrial Liaison post.
8 Outreach fails. 1 2 2 Appoint Dissemination Officer.
9 Staffing missing/unqualified.
1 3 3 Build on existing expertise. Assume likelihood is low if early approval of GridPP2 extension.
10 Organisational problems.
1 3 3 Define/build/agree GridPP3 structure. Clarify the role of GridPP3 and its interactions.
11 Technical Risks - See GridPP2 risks: R9, R10, R13, R14, R16, R22, R25, R27, R36, Also, physical risks.
* * * Develop a full GridPP3 Risk Register based on that from GridPP2.
Adopt a conservative approach to technology deployment.
12 Inadequate support infrastructure.
2 2 4
Monitor performance of support activities via pre-defined metrics.
13 Lack of interoperability. 2 2 4 Active engagement in NGS, GGF, WLCG, EGEE.
14 Security compromise. 3 3 9
Work with other e-Infrastructure providers. Limit capability through portals. Key part of user training.
ID NameLi Im Risk Li Im Risk Li Im Risk Li Im Risk Li Im Risk
R1 Recruitment/retention difficulties 2 2 4 2 2 4 2 2 4 2 2 4R2 Sudden loss of key staff 1 3 3 1 3 3 1 3 3 1 4 4R3 Minimal Contingency 2 2 4 1 2 2R4 GridPP deliverables late 1 3 3 2 3 6 3 2 6R5 Sub-components not delivered to project 1 2 2 2 3 6 2 3 6 2 3 6R6 Non take-up of project results 2 1 2 1 4 4 2 2 4 1 4 4R7 Change in project scope 1 1 1 2 2 4R8 Bad publicity 1 3 3 1 3 3 1 3 3 2 3 6R9 External OS dependence 3 1 3
R10 External middleware dependence 4 2 8 1 4 4 3 3 9 2 2 4R11 Lack of monitoring of staff 1 2 2 2 2 4 2 2 4 2 2 4R12 Withdrawal of an experiment 2 3 6 1 4 4R13 Lack of cooperation between Tier centres 2 2 4 1 4 4R14 Scalablity problems 1 2 2 2 2 4R15 Software maintainability problems 2 2 4 1 3 3 2 2 4 1 3 3R16 Technology shifts 1 2 2 2 3 6 2 3 6R17 Repitition of research 3 2 6R18 Lack of funding to meet LCG PH-1 goals 4 1 4R20 Conflicting software requirements 3 2 6 1 3 3R22 Hardware resources inadequate 3 3 9 4 3 12 3 3 9R25 Hardware procurement problems 3 2 6 2 3 6R26 LAN Bottlenecks 1 3 3R27 Tier-2 organisation fails 2 2 4
R28 Experiment Requirements not met 1 4 4R29 SYSMAN effort inadequate 2 3 6R30 Firewalls interfere with Grid 2 3 6R31 Inablility to establish trust relationshipsR32 Security inadequate to operate Grid 3 3 9R33 Interoperability 2 3 6R35 Failure of international cooperation 2 1 2R36 e-Science and GridPP divergence 1 3 3R37 Institutes do not embrace Grid 2 2 4R38 Grid does not work as required 4 2 8 4 2 8R39 Delay of the LHC 2 2 4R40 Lack of future funding 2 3 6 2 3 6 3 3 9R41 Network backbone failure 0 4 1R42 Network backbone bottleneck 2 2 4R43 Network backbone upgrade delay 1 4 4R44 Inadequate User Support 1 4 4
Alt-i-r
Pro. GridGridPP LCG MSN Apps
6/September/2006 GridPP3 D. Britton
Working Allowance and Contingency
Item Contingency [£m] Working Allowance [£m]
Tier-1 Staff 0.478
Tier-1 Hardware 1.758
Tier-2 Staff 0.853 0.426
Tier-2 Hardware 1.537
Management Staff 0.049
Operations 0.300
TOTAL 4.148 1.253
15% of Tier-1 HW (cost uncertainties).(a)
(a)
(b)
(b)4 FTE at Tier-2 (market approach).
(c)
(c)15% of Tier-2 HW (cost uncertainties) + 15% (market approach).
(d)
(d)2 FTE at Tier-1`(service level).
(e)
(e)2 FTE at Tier-2`(service level).
6/September/2006 GridPP3 D. Britton
Total Project Cost
Work Package
Cost Table TOTAL [£m]
A Tier-1 Staff Hardware
4.99
11.72
B Tier-2 Staff Hardware
3.29
5.12
C Grid Support Staff 4.50
D Grid Operations Staff 1.89
E Management 1.17
F Outreach 0.37
G Travel and Other 0.84
Project Sub-Total 33.90
Working Allowance (4%)
1.25
Project Cost 35.15
Contingency (12%) 4.15
Tier-1 Running Costs 2.50
Full Approval Cost 41.80
Tier-225%
Support13%
Operations.6%
Management3%
Outreach1%
Tier-150%
Travel+Other2%
6/September/2006 GridPP3 D. Britton
Responses to Referee Questions
6/September/2006 GridPP3 D. Britton
Exclusivity?
“There is clearly a compelling advantage for the physicists concerned to be aligned with and pool resources with the rest of the global alliance that comprises LCG. However, this does not need to be an exclusive alliance.”
“long-term operational costs, quality of service and interdisciplinary collaboration could surely be improved by a much more integrated and synergistic approach.”
• GridPP has engaged with wider community (and has reported this to PPARC through RCUK annual reports)
• GridPP’s first Grid application was GEANT-based for LISA
• Community is however focussed on its scientific priorities: LHC start-up timescale provides the primary focus
6/September/2006 GridPP3 D. Britton
Outsourcing?
“companies are developing expertise in service hosting and provision with many opportunities to develop experts, teams, resource management systems and operational/business knowledge.”
• GridPP has engaged with BT (visits to hosting site in St Albans, meeting with BT management at IC) and discussed possibilities fully in the past.
• Recent IT outsourcing exercises at Bristol and Cambridge indicate that costs are prohibitive (but that these may be offset by a joint PR programme).
6/September/2006 GridPP3 D. Britton
Novel? Original? Timely?
“novelty is entirely inappropriate when the goal is a highly reliable, ubiquitous and always available e-Infrastructure”
“similar undertakings of various scales are underway in many countries”
• GridPP notes that many of the methods used have not been tested at the required scale
“The LHC is likely to start producing data by 2007 and the proposed e-Infrastructure must be ready by that date if UK PP is to benefit from that data.”
6/September/2006 GridPP3 D. Britton
Relationships?
“the PP grid community has not yet engaged in collaboration on standardising data replication, data location, caching and reliable data movement services.”
• Globus RLS was based on earlier collaboration with EDG, inc. GridPP input
• GridPP plans to include higher level replication services, built on current expertise
6/September/2006 GridPP3 D. Britton
Reliable methods?
“In house development of middleware and tools is almost certainly inappropriate”
• GridPP agrees and, hence, the focus is on support and maintenance of existing components, with planned reductions in manpower
• Appendix A2 Middleware Support Planning Document expands upon the identified components as either “mission critical” to UK exploitation or as part of the UK’s input in the wider international context or it is possible to demonstrate leverage
6/September/2006 GridPP3 D. Britton
Industrial relevance?
• “significant technology transfer depends on long-term and sustained collaboration where mutual understanding develops and co-adaptation follows”
• GridPP agrees: we are proposing a dedicated 0.5FTE in this area and believe this will represent good value at this level
6/September/2006 GridPP3 D. Britton
Viability?
“There is a significant risk that the gLite stack will prove incapable of development for large scale, wide-spread and production QoS use. It is already very complex..”
• GridPP agrees that there is a risk, but the expanded use of gLite across an ever-increasing infrastructure indicates that these problems are being overcome
“It is better than it was but it by no means free from risk and misdirection.”
6/September/2006 GridPP3 D. Britton
Planning?
“The proposal states that “A future EGI project, including particle physics as one of the leading applications, may have started”. There are other future scenarios. One is the model already used in GÉANT..”
• GridPP agrees that e.g. UKERNA could have been asked to “manage the Grid”, but this is not currently planned
• Our intention is to (continue to) engage fully with the NGS and other bodies as discussed in appendix A7 National Context Document
6/September/2006 GridPP3 D. Britton
Planning?
“I would strongly recommend that a production e-Infrastructure project should not use bespoke software.”
• GridPP agrees – the reference was to experiment-specific code that is currently necessary to fill gaps in the middleware
“It is essential to separate all forms of maintenance, especially bug fixing and “improvements” from operations and to conduct it in a software engineering environment with strict quality controls, testing and release procedures.”
• GridPP agrees – the quality controls, testing and release procedures are of a high standard
6/September/2006 GridPP3 D. Britton
Planning?
“It is clear that a production service team should draw on others who should develop such services, not develop them themselves.” …
“It is probably necessary to carry on some aspects of the above work, but these require very careful selection and they should be collaborative with other disciplines and grid projects, and include strategies where the development and maintenance is eventually handed over to others.”
• GridPP agrees – in the GridPP3 proposal we discuss a very limited subset of maintenance and support developments that were proven to be necessary (and were effective) in the past or can be envisaged to be required in future
c.f. “Storage management is an area where there is already good international collaboration led by the PP community on standards and implementations using the SRM specifications”
6/September/2006 GridPP3 D. Britton
Past effectiveness?
“The previous two GridPP projects have taken on demanding and challenging engineering, management and R&D tasks. They have been exceptionally successful, as establishing and running grid services on this scale requires world-leading innovation. This has required professional leadership and adept collaboration. There is plenty of evidence of their ability and the advent of LHC data will guarantee their motivation. Their particular strengths are in service management, deployment and operation on a global scale.”
• GridPP agrees
6/September/2006 GridPP3 D. Britton
Suitability
“The two previous GridPP projects have demonstrated that they are capable of recruiting, sustaining and managing such a multi-site team. There is likely to be a substantial carry forward of the GridPP2 team. Can you quantify the level of continuity that the project depends on and the assessment of the risk that this continuity will not be met?”
• GridPP agrees – there is a significant risk that the current expertise will be lost due to planning uncertainty. This was addressed in the proposal by the request for early approval of the GridPP2 continuation component.
6/September/2006 GridPP3 D. Britton
Reduce number of Tier-2 sites?
“It might be helpful to review carefully whether long-term savings can be made by concentrating Tier-2 resources over fewer sites. Currently table 10 shows 17 sites for Tier-2 resources. Is there really a case for resources at each of these sites?”
• All institutes have delivered on their past MoU commitments (past performance was factored into the proposed sharing of Tier-2 resources)
• If PPARC chose to invest at a small subset of sites, then significant long-term buildings and infrastructure investment would be required (that has not been planned)
• In addition the utility costs of these would be exposed (currently hidden)
• If PPARC chose to select a larger subset of sites, there would be limited gains
“Possibly leveraging SRIF funding is a consideration.”
6/September/2006 GridPP3 D. Britton
Cost-effectiveness
““matching funding” is not a justification” (for 7-month GridPP2 continuation in the context of EGEE-II)
• The main case is built upon GridPP2 completing its mission to establish a Production Grid, prior to LHC data-taking mode
• This enables retention of key staff whilst planning for the Exploitation phase in GridPP3
6/September/2006 GridPP3 D. Britton
Code efficiency improvements?
“How do you trade between investing in software engineering to improve code performance against investing in more CPU?”
• LHC experiment codes are already highly optimised for the complex data analysis required
• There is significant investment in the optimisation effort within the experiments and the requirements take into account future optimisations
• The optimisations take account the (distributed) Grid computing constraints
6/September/2006 GridPP3 D. Britton
Usage increases?
““use by a much larger community intent on individual analyses” requires further justification. How do you demonstrate this community will actually engage and actually generate this additional use?”
• The experiment requirements anticipate increasing analysis across the experiments
• This is quantified by experiment in the proposal appendices
2. “ALICE Computing Technical Design Report”, lhcc-2005-018.pdf, 114pp.
3. “ATLAS Computing Technical Design Report”, lhcc-2005-022.pdf, 248pp.
4. “CMS: The Computing Project Technical Design Report”, lhcc-2005-023.pdf, 169pp.
5. “LHCb Computing Technical Design Report”, lhcc-2005-019.pdf, 117pp.
6/September/2006 GridPP3 D. Britton
Data Management?
“Companies such as Oracle and IBM supply well-honed distributed database technologies capable of high volume and high throughput. Developing PP-specific and home grown solutions is very unlikely to be cost effective.”
• Oracle are fully incorporated into LCG planning, with (low cost) Worldwide Oracle database services used for core metadata functions
6/September/2006 GridPP3 D. Britton
Tier-2 additional support?
“Table 12 appears to identify an anomaly that suggests that the plan is not as cost effective as it should be.”
• Tier-2 support effort is currently cross-subsidised through:1. the PP rolling grant programme;2. Institute (e.g. computing service) support• Component 1 was anticipated not to be viable• Component 2 was modest, but is expected to continue at
~this level
• We have requested Contingency to cover the possibility that component 2 is not preserved (15% on the hardware cost in addition to another 15% that covers the future price uncertainty; plus an additional 4 FTE - 1 at each Tier-2)
• We have also requested Working Allowance of an additional 2 FTE at Tier-2s to be used if the service level falls short
6/September/2006 GridPP3 D. Britton
Context Planning?
“The development of this interdependency and cooperation should be explicitly planned and specified from the start of GridPP3.” e.g.
““forms part of the National e-Infrastructure” – what part?”
““CA” LCG uses one system…”
““training” What source of training is this?”
• All plans are integrated with NGS and EGEE in these areas and expanded upon in appendix A7 National Context Document
6/September/2006 GridPP3 D. Britton
Overall Scientific Assessment
“This proposal is fundable and should be funded. Because of its significance to an extensive research community a decision to proceed should be made quickly.”
• GridPP agrees• The outline answers provided to the
referees’ questions are provided in anticipation of such a PPRP decision
6/September/2006 GridPP3 D. Britton
Referee 2
• Proposal Details: Reference number: PP/E00296X/1, Grant panel: Projects peer review panel, Grant type: Standard.
• The Proposal: Science quality: I really cannot comment on the pure science, not being a particle physicist. The proposal itself deals with deploying and operating a production GridPP, and as such is mostly infrastructural engineering and computer science of a software engineering flavour, rather than pure research. This is as it should be for a proposal of this type.
• In this sense the proposal is of a high quality. It is of course worthwhile in that it will be impossible for the UK particle physics community to fully engage with the LHC without GridPP3.
• Objectives: The grand objectives are clear enough in the executive summary, the more detailed objectives are distributed throughout the proposal, and perhaps could benefit from a summary tabulation. The objectives are sound but ambitious to an extent that perhaps threatens availability.
• Management: Based on GridPP2, appears to work well.• Program Plan: Timescales & milestones hard to find.• Significance: This is a very significant infrastructure for the future of particle physics in the UK.• c/f Other Work: GridPP has performed very well in the EU context, and also in experimental transatlantic
work, and is a central partner in EGEE. The proposed infrastructure is a part of an overall global grid required for LHC.
• Methodology: A continuation and expansion from GridPP2, and likely to be successful if the manpower resources are adequate to the task.
• Industry: Limited proposals.• Planning: The related planning documents exhibit a good degree of coherency.• Past Record: The past performance has been good to excellent.• Suitability: Very suitable.
6/September/2006 GridPP3 D. Britton
Project Plan?
“Timescales & milestones hard to find.”• The intention is to use the project
management methods used (successfully) in GridPP1 and GridPP2
• The approach taken to GridPP3 is different to that of GridPP1(2) planning
• A set of high-level deliverables can be prepared in the light of PPRP feedback, if requested
6/September/2006 GridPP3 D. Britton
Backup Slides
6/September/2006 GridPP3 D. Britton
GridPP2 ProjectMap
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116
0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.1470.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
2.1 3.1 4.1 5.1 6.1
1.1.1 1.1.2 1.1.3 1.1.4 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5
1.1.5 2.1.6 2.1.7 2.1.8 2.1.9 2.1.10 3.1.6 3.1.7 3.1.8 3.1.9 3.1.10 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 6.1.6 6.1.7 6.1.8 6.1.9
2.1.11 2.1.12 3.1.11 3.1.12 3.1.13 4.1.11 4.1.12 5.1.11 5.1.12
2.2 3.2 4.2 5.2 6.2
1.2.1 1.2.2 1.2.3 1.2.4 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5
1.2.5 2.2.6 2.2.7 2.2.8 2.2.9 2.2.10 3.2.6 3.2.7 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 6.2.6 6.2.7 6.2.8 6.2.9 6.2.10
2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 4.2.11 4.2.12 4.2.13 4.2.14 4.2.15 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 6.2.11 6.2.12 6.2.13 6.2.14
2.3 3.3 4.3 6.3
1.3.1 1.3.2 1.3.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5
2.3.6 2.3.7 2.3.8 2.3.9 2.3.10 3.3.6 3.3.7 3.3.8 3.3.9 3.3.10 4.3.6 4.3.7 4.3.8 4.3.9 4.3.10
2.3.11 3.3.11 3.3.12 3.3.13 4.3.11 4.3.12 4.3.13
2.4 3.4 4.4 6.4
2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 6.4.1 6.4.2 6.4.3 6.4.4
2.4.6 2.4.7 2.4.8 2.4.9 2.4.10 3.4.6 3.4.7 3.4.8 3.4.9 3.4.10 4.4.6 4.4.7 4.4.8 4.4.9
2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 3.4.11 3.4.12 3.4.13 3.4.14 3.4.15
2.5 3.5 60 Days2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5
2.5.6 2.5.7 2.5.8 2.5.9 2.5.10 3.5.6 3.5.7 3.5.8 3.5.9 Monitor OK 1.1.1 2.5.11 Monitor not OK 1.1.1 Milestone complete 1.1.1
2.6 3.6 Milestone overdue 1.1.1
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 Milestone due soon 1.1.1
2.6.6 2.6.7 2.6.8 2.6.9 2.6.10 3.6.6 3.6.7 3.6.8 3.6.9 3.6.10 Milestone not due soon 1.1.1
2.6.11 2.6.12 2.6.13 Item not Active 1.1.1
Workload
6
1.2
Development
Dissemination
Project Execution
BaBarMetadata
Storage
2 3
Knowledge Transfer
LHCb
GANGA
ATLAS
InteroperabilitySamGrid
Engagement
Production Grid Milestones Production Grid Metrics
1LCG External
4M/S/N
5Non-LHC Apps Management
Navigate downExternal link
PhenoGrid
LHC Apps
1.1
1.3
Security
InfoMon
Design
Service Challenges
Other Link Network LHC Deployment
Project Planning
CMS
Portal
Status Date - 31/Dec/05 + next
UKQCD
Update
Clear
Metric OK
Metric not OK
Tasks Complete
Tasks Overdue
Tasks due in next 60 days
Items Inactive
Tasks not Due
Change Forms
88
(91%)
9 103
(40%)
7 16 20 132 37
6/September/2006 GridPP3 D. Britton
Convergence with NGS
6.2 Interoperability Status Date 30-Jun-06Owner: Neil GeddesNumber Due Status
6.2.1 01-Oct-06 Complete
6.2.2 01-Jan-06 In Progress
6.2.3 01-Jun-05 Complete
6.2.4 01-Jan-05 Complete
6.2.5 01-Apr-06 In Progress
6.2.6 01-Apr-05 In Progress
6.2.7 01-Apr-06 Not Started
6.2.8 31-Aug-07 Not Started
6.2.9 On going OK
6.2.10 On going OK
6.2.11 On going OK
6.2.12 On going OK
6.2.13 On going OK
6.2.14 01-Nov-05 Complete6.2.15
Joint GridPP/NGS plan for web services deployment
Title
Common GridPP/NGS helpdesk and problem tracking infrastructure
First jointly supported service
Final stage connection of GridPP sites to NGS
Integrated plan for Grid support in the UK beyond 2007
Second stage connection of GridPP sites to NGS
Common security policy
First stage connection of GridPP sites to NGS
Number of NGS representatives on GridPP committeesGridPP attendance at NGS committee meetingsNumber of Non-HEP applications tested on GridPP GridNumber of GridPP members attending GGF meetingsNumber of GridPP members in charge of formal GGF Working GroupsImplemented Common Security Policy
- The slow emergence of real web-services solutions means that 6.2.2 will probably not be completed during GridPP2.- GridPP is committed to gLite and NGS intends to be compatible with this but can not deploy the full gLite stack.- GridPP collaboration is discussing formal affiliation with NGS and presently Edinburgh are NGS affiliates and Oxford, RAL, Manchester, and Lancaster are partners. Discussions underway with Glasgow, UCL, and IC.
6/September/2006 GridPP3 D. Britton
In the Beginning…
The UK Grid for HEP really started to grow in 2000 with the release of the Hoffman report into LHC computing requirements and the results of the UK Government Spending Review (SR2000) which targeted £80m for e-Science.
£80m Collaborative projects
Generic Challenges EPSRC (£15m), DTI (£15m)
Industrial Collaboration (£40m)
Academic Application SupportProgramme
Research Councils (£74m), DTI (£5m)PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) ESRC (£3m) EPSRC (£17m) CLRC (£5m)
6/September/2006 GridPP3 D. Britton
Hardware Costs
Estimated Price per KSI2K
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Pri
ce
/KS
I2K
(£
K)
0
1
2
3
4
5
6
7
8
LN
(Pri
ce/
MS
I2K
)
Price/KSI2K (£K) LN(Price/MSI2K)
Storage Cost
0
2
4
6
8
10
12
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
£K
0
1
2
3
4
5
6
7
8
9
10
LN
(£K
)
Price/TB (K)
LN(Price/PB)
Kryder’s Law for disk cost
Moore’s Law for CPU cost
Hardware costs extrapolated from recent purchases. However, experience tells us there are fluctuations associated with technology steps. Significant uncertainty in integrated cost.
Model must factor in:- Operational life of equipment- Known operational overheads- Lead time for delivery and deployment.
6/September/2006 GridPP3 D. Britton
Hardware Costs: Tape
2007 2008 2009 2010 2011 2012
CAPACITY MODELRequired capacity 816 2538 4808 7682 9753 12085Actual CASTOR Capacity 544 2538 4808 10516 10129 12085
9940 MediaExisting 9940 Slot Count 1948 0 0 0 0 0Media Capacity (9940) 0.182 0.182 0.182 0.182 0.182 0.182Existing 9940 Capacity 324 0.000 0.000 0.000 0.000 0.000
T10/20K Media Total Required Tape Capacity April (TB) 816 2538 4808 7682 9753 12085Tapes phased out in March 0 0 0 0 430 778Total Tapes Available in March 430 1208 5639 10684 11254 10476Total Storage Capacity (March) 194 544 2538 9616 10129 9429Addirtional TB Required for April 350 1994 2270 0 0 2656Additional Tapes Purchased 778 4432 5045 1000 0 2951Used Slots April (T10/20K) 1208 5639 10684 11684 11254 13428T10/20K Media Cost 0.08 0.07 0.06 0.06 0.06 0.06Media Capacity 0.45 0.45 0.45 0.9 0.9 0.9Spent on Media 62 310 303 60 0 177
Spent on new Robot Infrastructure 250 50 50New Slots Purchased 6000 2000 2000Maximum Slot Count Available 5000 11000 11000 13000 13000 15000Total Used Slots 3156 5639 10684 11684 11254 13428
Bandwidth MODEL
Estimated rate to Fill (6 months) 32 114 151 191 137 155In beam Double Fill Rate 228 301 381 275 309In beam Media Conversion (6 months) 17 319In beam reprocessing 114 151 191 137 155Out of beam Reprocessing Read Rate (4 months?) 252 478 764 970 1202Drive deadtime on writes 25% 25% 25% 25% 25% 25%Drive deadtime on Reads 25% 25% 25% 25% 25% 25%
In beam write capacity required 327 401 933 366 412Out Beam write capacity required 0 0 0 0 0In beam read capacity required 174 201 679 183 206Out Beam read capacity required 337 638 1019 1294 1603
In beam total required bandwidth 501 602 1613 549 619out beam total required bandwidth 337 638 1019 1294 1603Total available CASTOR bandwidth 555 640 720 1680 1320 1680
9940B Drives 6 3 0 0 0 09940B Maintainance Cost/drive 3.0 3.3 3.6 4.0 4.4 4.8Spent on 9940B Maintainance 18 9.9 0 0 0 0Bandwidth per brick (MB/s) 25 25 25 25 25 259940B Bandwidth 150 75 0 0 0 0
Cost of Storage Brick (T10) 19.15 19.15 19.15 19.15 19.15 19.15T10K Maintainance Cost/drive 2.3 2.3 2.3 2.3 2.3 2.3New T10K Server Bricks 3 2 1 0Total T10K Server Bricks 6 8 9 9 0 0Bandwidth per brick (MB/s) 80 80 80 80 80 80Spent on Server bricks 57.45 38.3 19.15 0 0 0Spent on T10K Maintaince 6.9 13.8 18.4 20.7Total T10K Bandwidth 480 640 720 720 0 0
Cost of Storage Brick (T20) 19.15 19.15 19.15T20K Maintainance Cost/drive 2.3 2.3 2.3New T20K Server Bricks 8 3 3Total T20K Server Bricks 8 11 14Bandwidth per brick (MB/s) 120 120 120Spent on Server bricks 0 0 0 153.2 57.45 57.45Spent on T20K Maintaince 0 0 0 0 18.4 25.3Total T10K Bandwidth 0 0 0 960 1320 1680
Spent on ADS Maintainance 10 10 0 0 0 0Spent on Minor Parts 10 10 10 10 10 10Spent on Robot 1 M&O 30 30 30 30 30 30Spent on Robot 2 M&O 50 50 55 55 60
Summary
Spent on Media 62 310 303 60 0 177Spent on Bandwidth and Operation 132 412 128 319 189 258Spent Total 195 722 430 379 189 435
6/September/2006 GridPP3 D. Britton
Running Costs
(Work in progress)
Running Costs CPU2007 2008 2009 2010
New Systems 166 761 404 473New Racks 5 24 13 15Phased out racks 4 3 5 0Rack Count 18 39 47 61KW/New System 0.26 0.26 0.27 0.29
198 110 136Phased Out KW 18 51 0Total Load (KW) 151 330 390 525Cost Per KW 0.00008 0.00008 0.00009
£0k £347k £430k £609k
New KW
Cost
Disk2007 2008 2009 2010
101 201 82 13414 29 12 193 4 0 10
32 57 69 780.735 0.77 0.81 0.85
155 66 11414 0 49
116 257 323 3880.00008 0.00008 0.00009
£0k £270k £357k £450k
6/September/2006 GridPP3 D. Britton
Tier-1 Growth
Now Start of GridPP3 End of GridPP3
Spinning Disks ~2000 ~10,000 ~20,000Yearly disk failures 30-45 200-300? 400-600?
CPU Systems ~550 ~1800 ~2700Yearly system failures 35-40 120-130? 180-200?
To achieve the levels of service specified in the MOU, a multi-skilled incident response unit (3 FTE) is proposed. This is intended to reduce the risk of over-provisioning other work areas to cope with long term fluctuations in fault rate. These staff will have an expectation that their primary daily role will be dealing with what has gone wrong. They will also provide the backbone of the primary callout team.
6/September/2006 GridPP3 D. Britton
Tier-2 Allocations
• Take each experiment’s CPU and Disk requirements (from Dave Newbold)
• For each experiment – share out among Tier-2s• For each Tier-2 share out among institutes• Sum over experiments(maintains the correct CPU/Disk ratio)
Sharing guided by:• Size of local community (number of Ac/Ph/PP)• Past delivery (KSI2K to date, Disk usage last
quarter)• Current resources available
6/September/2006 GridPP3 D. Britton
Tier-2 ‘Shares’
Physicists
FTEs
Existing Resources
1Q06Delivery to
date Disk used
1Q06 Summary
Tier-2 LHC OnlyKSI2
K TB KSI2K
Hrs TB Min Max Ave
London 4026%
1049.0 37.7 27%
1,348,236
39% 17.9 21% 21% 39% 28%
NorthGrid 33
22%
1783.1
132.2 48%
1,229,271
36% 34.2 40% 22% 48% 36%
ScotGrid 14 9% 354.0 44.6 10% 187,443 5% 21.0 24% 5% 24% 12%
SouthGrid 66
43% 516.4 48.4 15% 661,080
19% 13.4 15% 15% 43% 23%
Total 152 3702.
5262.
9 3,426,03
0 86.6
~35%
~35%
~10%
~20%
6/September/2006 GridPP3 D. Britton
Example
ATLAS CMS LHCb Other
London 0.25 0.75 0.10 0.30
NorthGrid 0.50 0.00 0.20 0.40
ScotGrid 0.15 0.00 0.30 0.10
SouthGrid 0.10 0.25 0.40 0.20
CMS Requirement in 2008 is 1800 KSI2K and 400 TB
Tier-2 sharing matrix:
Institute sharing matrix:
ATLA
S CMS LHCb Other
Brunel 0.00 0.10 0.00 0.15
Imperial 0.00 0.90 1.00 0.00
QMUL 0.70 0.00 0.00 0.60
RHUL 0.20 0.00 0.00 0.15
UCL 0.10 0.00 0.00 0.10
i.e Imperial ‘allocation’ is 1800 KSI2K (400 TB) x 0.75 x 0.9 = 1215 KSI2K (270 TB)
(PMB/Tier-2 Board)
(Tier-2 Board)
6/September/2006 GridPP3 D. Britton
Allocated CPU v 'Size '
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
0% 5% 10% 15% 20%
Size/Delivery/Current
Allo
cate
d C
PU
Imperial
Bristol
Glasgow Liverpool
Crosscheck:
6/September/2006 GridPP3 D. Britton
Tier-2 Staff
Institute FTE FTE %Brunel 0.50 3%Imperial 1.50 10%QMUL 1.00 7%RHUL 0.50 3%UCL 1.00 7%Lancaster 1.50 10%Liverpool 1.00 7%Manchester 1.50 10%Sheffield 1.00 7%Durham 0.25 2%Edinburgh 0.50 3%Glasgow 1.00 7%Birmingham 1.00 7%Bristol 1.00 7%Cambridge 0.50 3%Oxford 0.50 3%RAL PPD 0.50 3%
14.75 100%
Allocated FTE v CPU
0%
2%
4%
6%
8%
10%
12%
0.0% 5.0% 10.0% 15.0% 20.0% 25.0%
Allocated CPU
Allo
cate
d F
TE
Manchester
Birmingham and UCL
Sheff ield Imperial
6/September/2006 GridPP3 D. Britton
Proposal Procedure
Tier-1£AmTier-2£BmMiddleware£CmApplications£DmManagement £Em…Total £Xm
Proposal
Tier-1 £amTier-2 £bmMiddleware £cmApplications £dmManagement £em…Total £YM
Re-evaluation
Institute 1 £fmInstitute 2 £gmInstitute 3 £hmInstitute 4 £imInstitute 5 £jm…Total £YM
Apply for Grants
Peer Revie
w
GridPP1/ GridPP2
GridPP3
Peer Revie
w
Tier-1£AmTier-2£BmMiddleware£CmApplications£DmManagement £Em…Total £Xm
Proposal
Institute 1 £FmInstitute 2 £GmInstitute 3 £HmInstitute 4 £ImInstitute 5 £Jm…Total £XM
AllocateInstitute 1 £fmInstitute 2 £gmInstitute 3 £hmInstitute 4 £imInstitute 5 £jm…Total £YM
Is this still a sensible project?
6/September/2006 GridPP3 D. Britton
GridPP3 Deployment Board
In GridPP2, the Deployment Board is squeezed into a space already occupied by the Tier-2 Board; the D-TEAM; and the PMB. Many meetings have been “joint” with one of these other bodies. Identity and function have become blurred.
T1B Chair T2B Chair Prdn Mgr.
Deployment Board
Technical Coordinator
Tier-1 Board Tier-2 Board D-Team Grp-1 Grp-2 … Grp-n
Project Management Board
XIn GridPP3, propose a combined Tier-2 Board and Deployment Board with overall responsibility for deployment strategy to meet the needs of the experiments. In particular, this is a forum where providers and users formally meet. Deals with:
1) Issues raised by the Production Manager which require strategic input.2) Issues raised by users concerning the service provision. 3) Issues to do with Tier-1 - Tier-2 relationships. 4) Issues to do with Tier-2 allocations, service levels, performance. 5) Issues to do with collaboration with Grid Ireland and NGS.
6/September/2006 GridPP3 D. Britton
GridPP3 DB Membership
1) Chair 2) Production Manager 3) Technical Coordinator 4) Four Tier-2 Management Board chairs. 5) Tier-1 Board Chair. 6) ATLAS, CMS, LHCb representatives. 7) User Board Chair. 8) Grid Ireland representative 9) NGS representative. 10) Technical people invited for specific issues.
Above list gives ~13 core members, 5 of whom are probably on PMB. There is a move away from the technical side of the current DB and it becomes a forum where the deployers meet each other and hear directly from the main users. The latter is designed to ensure buy-in by the users to strategic decisions.
6/September/2006 GridPP3 D. Britton
Grid Data Management
Operational Support: FTS; metadata catalogues as they are deployed; replica optimisation services eventually.
Maintenance: Metadata services and eventually replica optimisation services.
Development: Common metadata services; Replica optimisation.
Components: File transfer services.Metadata Catalogues.Services to manage the replication of data.
6/September/2006 GridPP3 D. Britton
Storage Management
Operational Support: All above components. Hope to reduce number.
Maintenance: GridPP “owns” dCache installation and configuration scripts within LCG, and the SRM2 interface to CASTOR.
Development: None envisaged in GridPP3 era. However, SRM version-3 may impose some requirements
Components: DPM (used at 12 Tier-2 sites in UK)dCache (used at Tier-1 and 7 Tier-2 sites in UK)CASTOR SRM1 (Tier-1 but to be phased out in 2006)CASTOR SRM2 (Tier-1 - primary developer).
6/September/2006 GridPP3 D. Britton
Information and Monitoring
Operational Support: R-GMA
Maintenance: R-GMA and SD.
Development: R-GMA may still require development at start of GridPP3. Glue schema likely to require ongoing development (minor effort).
Components: R-GMA (information system slated to replace the BDII)Service Discovery (SD)APEL accounting (uses R-GMA)GLUE Schema (information model to define Grid resources)
6/September/2006 GridPP3 D. Britton
Workload, Performance and
Portal
Operational Support: WMS, Job information repository. Job information analysis.
Maintenance: WMS-testing, Job information scripts, RTM, Portal.
Development: Portal (to address needs of new users); Job information scripts (to enrich/optimise content);
(Possibly RTM if evolution still required/desired).
Components: WMS (Resource Broker, Logging & Bookkeeping server etc).Tools to gather job information (used by ATLAS, CMS, and the RTM).Real Time Monitor (RTM).GridPP Portal.
6/September/2006 GridPP3 D. Britton
Security
Operational Support: GridSite and VOMS. Operational Security Officer Post. International Security Coordination Post.
Maintenance: GridSite
Development: GridSite
Components: - GridSite Toolkit (includes Grid Access Control Language GACL and GridSite’s Apache extension mod_gridsite both used by ATLAS and CMS)- VOMS
6/September/2006 GridPP3 D. Britton
Networking
Operational Support: Network monitoring and diagnostics.
Maintenance: Minor.
Development: None.
Components: - High level contacts with JISC and UKERNA. - Requirements and provisioning - Work with providers in respect to interfaces to Grid – Network operations. - Network monitoring and diagnostic tools.
6/September/2006 GridPP3 D. Britton
Active Users (All VOs)
6/September/2006 GridPP3 D. Britton
Active Users by LHC experiment
ALICE (8)
CMS (150)
ATLAS (70)
LHCb (40)
6/September/2006 GridPP3 D. Britton
Job success? Overview
6/September/2006 GridPP3 D. Britton
Job Success by LHC experiment
ALICE
CMS
ATLAS
LHCb