national contribution to the development of the lcg computing … · 2017. 12. 27. · incdtim grid...

17
22.10.2013, IFA 1 National contribution to the development of the LCG computing grid for elementary particle physics - CONDEGRID - Department of Computational Physics and Information Technologies 'Horia Hulubei' National Institute for R&D in Physics and Nuclear Engineering (IFIN-HH), Magurele, Romania M. Dulea ISAB Meeting, IFA, Magurele, 22.10.2013

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 1

National contribution to the development of

the LCG computing grid for elementary

particle physics

- CONDEGRID -

Department of Computational Physics and Information Technologies 'Horia Hulubei' National Institute for R&D in Physics

and Nuclear Engineering (IFIN-HH), Magurele, Romania

M. Dulea

ISAB Meeting, IFA, Magurele, 22.10.2013

Page 2: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 2

OVERVIEW

1. EXPERIMENTAL CONTEXT

2. ACHIEVEMENTS: GRID RESOURCES

3. ACHIEVEMENTS: SERVICE RELIABILITY and GRID PRODUCTION

4. ACHIEVEMENTS: SITE OPTIMIZATION

5. ACHIEVEMENTS: ADOPTING NEW TECHNOLOGIES and INITIATIVES

6. ACHIEVEMENTS: TRAINING and DISSEMINATION

7. ACHIEVEMENTS: MANAGEMENT and COORDINATION

8. ADDITIONAL ACTIVITIES

9. PROSPECTS for 2014 and 2015

Page 3: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

3

1 – EXPERIMENTAL CONTEXT

22.10.2013, ISAB, IFA

ALICE - Has recently improved analysis job efficiency - plans transition from BitTorrent to CVMFS for distribution of ALICE software on WNs. - The computing consequences of the planned upgrade will be presented in a Computing TDR in Oct. ‘14

Preparing Run-2 Update of the Computing Models of the WLCG and the LHC Experiments – draft vers. Sept. 2013 - goal: to minimize operational costs. New: using opportunistic resources; computing as a service; etc.

ATLAS - The Run-2 Computing Model improves CPU/event consumption and AOD event sizes of data reconstruction, MC production chain and analysis - ATLAS resource management is modified such as to keep resource requirements within reasonable limits - Preparation of ATLAS Data Challenge (June 2014)

LHCb - All data reprocessed in spring 2013 - Reduction by factor 2 in event size, by improving compression/output format) - Growing need of disk space; 50 TB/week increase in disk usage; sw. efforts to reduce disk consumption - Decision to allow disk and analysis at some selected Tier2 sites

Page 4: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

4

1 - OVERALL RESOURCES PLEDGES and REQUIREMENTS

22.10.2013, ISAB, IFA

http://gstat-wlcg.cern.ch/apps/pledges/summary/, 05.10.2013

Tier2 pledges & needs =>

ALICE “has historically faced a lack of resources” Following CRSG’s recommendations, ALICE lowered the required levels, but this induced an even lower offer (T1s)

ATLAS “is using its pledged (and non-pledged) resources in full and consequently believes there is a solid case for an increase in the CRSG resource recommendations (04) for the 2014 RRB year” (ATLAS Comp. Res. Usage March 2013 - September 2013)

Tier-2s “provided more resources than pledged. All resources were fully used, mainly for MC simulation, user analysis (and prerequisite group production), and MC reconstruction”

Concerns: “There remains a potential danger that LHC physics output could be limited by having insufficient computing resources” (LHCC deliberations, 04.2013)

Page 5: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

5

2 - RESOURCE PROVISION

22.10.2013, ISAB, IFA

First task assumed by RO-LCG through MoU: to provide disk storage resources and computing power for Monte Carlo simulations and data analysis needed by three LHC experiments: ALICE, ATLAS, LHCb

RO-LCG provides Grid resources and services through 7 computing centres:

Upgrade of resources during Y2 of 8EU:

Computing: ~ 4800 cores => ~ 6000 cores

Storage: ~ 1.8 PetaBytes => ~ 2.2 PetaBytes

1. Resource centres funded from 8EU

RO-07-NIPNE (alice, atlas, lhcb)

RO-13-ISS (alice)

RO-14-ITIM (atlas)

RO-16-UAIC (atlas)

2. Resource centres (of the experimental groups @ IFIN-HH) that are not funded from 8EU

NIHAM (alice)

RO-02-NIPNE (atlas)

RO-11-NIPNE (lhcb)

8EU supports coordination, monitoring, and activity reporting for all RO-LCG grid sites (including experiments' sites).

Page 6: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

6

2 - FULFILLMENT of the RESOURCE PLEDGE

22.10.2013, ISAB, IFA

http://gstat-wlcg.cern.ch/apps/topology/federation/252/, 05.10.2013

R: The pledge target is overachieved

Second MoU task: to fulfill the annual resource pledge

The resource pledge for 2013-14 was established in 2012, in agreement with the experimental groups and with the CRSG requests

The fulfillment of the 2013 pledge was based on the 2012 upgrades

Page 7: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 7

3 - SERVICE RELIABILITY and GRID PRODUCTION

RO-LCG GRID PRODUCTION

Jan.-Oct. 2012: 85.3 million HS06-hours run.

Jan.-Sep. 2013: 110 million HS06-hours run.

The grid production of the sites supported by 8EU represents 60% of the total RO-LCG production

Reliability coefficient of RO-07-NIPNE =>

Grid production (in kSI2k-hrs.), run on RO-07-NIPNE, on the other sites supported from the 8EU project, and on the RO-

LCG sites that are not supported from the 8EU project =>

Third MoU task: to fulfill the service level agreement (SLA) stipulated in the MoU

Average reliability coefficient of the main site supported by 8EU, RO-07-NIPNE: 95%

Overachievement by other 8EU sites:

RO-13-ISS: 100%, RO-16-UAIC: 98%

Page 8: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 8

3 - GRID PRODUCTION in Jan.-Sep. 2013

R: RO-LCG ranks 11th among the 34 Tier-2 national centers regarding the cumulated ALICE+ATLAS+LHCb CPU hours, with a share of 2.18% of the total R: RO-LCG ranks 8th among the 34 Tier-2 national centers regarding the cumulated ALICE+ATLAS+LHCb number of jobs, with a share of 2.76% of the total

Page 9: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

9

Total CPU hours (kSI2k) and number of jobs recorded for RO-LCG during Sep.-Aug. intervals. The 2012-2013 values reflect the increase of the share of ATLAS analysis and ALICE (longer) jobs (src.: EGI accounting portal) =>

Y2 activity for RO-07-NIPNE (Jan.-Sept. 2013):

alice: production jobs, using CE1 (tbit01)

> 660000 jobs, ~ 2.47 MCPU hours

atlas: production & analysis jobs, using CE2 (tbit03)

~ 2000000 jobs, ~ 6.24 MCPU hours

lhcb: reprocessing jobs, using CE2

> 100000 jobs, ~ 1.45 MCPU hours

The improvement of ATLAS analysis capacity @ RO-07 strongly depends on the reliability of the external connection to the French ATLAS Cloud, and on the efficiency of data handling within the site.

0

5.000.000

10.000.000

15.000.000

20.000.000

25.000.000

30.000.000

35.000.000

Jobs

CPU hours

IFIN GRID

RO-07

4 - OPTIMIZATION OF RO-07-NIPNE

Page 10: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

10

Left: Topology of RO-07-NIPNE. Right: After optimization, an overall traffic of 11 Gb/s from the storage server to WNs was reached during the concurent running of 500 analysis jobs

The optimum writing speed on the WN’s disk can be reached using the best configuration both in the disk server and in the WN (throughput of disk write). Example: improving the performance of a storage facility with max transfer speed 8 Gbps.

Necessary to ensure the scalability of the cluster, by increasing at a constant rate the bandwidth available for data transfer whenever the storage capacity is upgraded

SE-WN bandwidth upgrades through switch cascading (stack configurations)

4 - OPTIMIZATION OF RO-07-NIPNE

CD-CC bandwidth upgrades (currently 2x10 Gbps)

SE design: special issues due to: a) 1 Gbps links; b) distribution of hardware in 2 separate locations (CD/CC); c) min. DD-WN transfer speed of 2.5 MB/s required ; d) repeated upgrades of the distributed and heterogeneous storage. Balance between the storage capacity and the network throughput of each data disk server must exist.

Page 11: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

11

Reaching the bandwidth saturation of the external link @ Magurele

4 - OPTIMIZATION OF RO-07-NIPNE

The large data transfer peaks mainly due to the grid sites, recorded since 2012, demonstrated the need of provision of a bandwidth larger than 10 Gbps. Buying a BGP router was justified. An upgrade to 40 Gbps is planned.

The transfer capacity of the FO link Magurele - RoEduNet NOC has reached its limits

Page 12: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 12

5 - ADOPTING NEW TECHNOLOGIES AND INITIATIVES

LHCONE (LHC Open Network Environment): a network that is private to the LHC Tier-1/2/3 sites ; provides better access to the datasets for the whole HEP community.

R: Integration of the LCG sites from Magurele into LHCONE (July 2013) Grid traffic separated from the common data traffic. A separate VLAN was defined on the existing network. Beneficiaries: first, ATLAS sites RO-02, RO-07; next, NIHAM, RO-11, RO-13-ISS At present all RO-LCG sites benefit (ATLAS) or will benefit of it when experiments decide (ALICE, LHCb)

Direct T2 (T2D) sites: reliable, primary hosts for ATLAS datasets (analysis) and group analysis, get and send data from different clouds, take part in cross cloud production

Test if transfer rates with at least 9 T1s are larger than 5 MB/s, for some periods of time (see Fig.). To find weak points, the transfer capacity of the link Magurele - RoEduNet NOC – GEANT was tested with 3 couples of perfSONAR servers (for bandwidth and latency), located at DFCTI, UPB, and in RoEduNet’s NOC.

Preparing for the deployment of IPV6

R: RO-07-NIPNE started tests for T2D qualification

The poor performance recorded was an argument for replacing the connection provider.

Two IPV6 classess (/48, /64) were acquired; tests between IFIN GRID Servers and RO-16-UAIC servers. Collaboration with RoEduNet.

Page 13: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 13

6 - TRAINING & DISSEMINATION

11.04 Calcul Atlas France (CAF) Meeting, Grenoble

27.09 RO-LCG 2013 Workshop - “Grid, Cloud and High Performance Computing in Science”, 'Mircea cel Batran' Naval Academy, Constanta (IEEE Proceedings)

14-18.10 Training for Grid operators and administrators, IFIN-HH.

CAF Meeting

RO-07-NIPNE STATUS, M. Ciubancan INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru

RO-LCG 2013

Data transfer optimisation for the DFCTI grid site, M. Ciubancan, S. Constantinescu, M. Dulea Monitoring LHCONE traffic in Romanian grid sites, C. Pinzaru, V. Vraciu, P. Gasner, M. Subredu, O. Rusu RO-14-ITIM upgrade to EMI3, F. Farcas, D. Nicoara, J. Nagy ISS Monitoring Technologies and Status Report on Grid Activities, I. Stan, A. Sevcenco, S. Zgura, L. Irimia

Page 14: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 14

7 - MANAGEMENT & COORDINATION

RO-LCG Report: Initiative started in January 2013 Discussed at the RO-LCG Meeting 12.02.2013 End of survey in February 2013. Final version presented to the recently designated Grid Committee (Ian Bird - WLCG Project Leader, Markus Schulz - WLCG Deployment Manager - IT Department, Jamie Shiers - WLCG Operations Manager - IT Department, Oliver Keeble - Leader of the EMI Data Management Project - IT Department, Frederic Hemmer - IT Department Head)

Following ISAB’s recommendadion, the Romanian WLCG Steering Committee (RWSC) was formed. The members of the RWSC are: WLCG project management: Mihnea Dulea ALICE coordinator/deputy: Mihai Petrovici; Claudiu Schiaua ATLAS coordinator/deputy: Calin Alexa; Gabriel Stoicea LHCb coordinator/deputy: Florin Maciuc; Alexandru Grecu

First meeting of the RWSC: CERN, 16.04.2013; chaired by Dir. Gen. Florin Buzatu (IFA) The members of the RWSC attend the monthly RO-LCG meetings.

Page 15: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 15

8 - ADDITIONAL ACTIVITIES

Collaboration with ATLAS France

Improvement of ATLAS analysis on RO-LCG sites. Participation of a RO specialist in ATLAS Distributed Computing Operations Shift & support team of FR Cloud (Squad). Training supported from 8EU.

Collaboration with LIT/JINR

Hulubei-Meshcheriakov programme, project “Optimization Investigations of the Grid and Parallel Computing Facilities at LIT-JINR and Magurele Campus”

Outreach activities

Continuing a program for high schools that started in 2012.

High school students from ‘Tudor Vianu' National College of Informatics visited the HPC Center @ IFIN-HH in February 2013 and followed a presentation on Computational Physics, including an introduction to LHC and Grid computing.

Page 16: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 16

9 - PROSPECTS 2014, 2015

Network support

T2D qualification of ATLAS sites (2014) 40 Gbps upgrade of the external link, in collaboration with RoEduNet (2014)

Grid site optimization

Further optimization of the local computing configurations for ATLAS and LHCb data reprocessing, to fulfill the requirements of a larger number of analysis jobs (2014)

Dissemination

RO-LCG conference, IFIN-HH (2014) RO-LCG workshop, ITIM, Cluj (2015)

New technologies and technological updates

Service for monitoring and control of resources in the framework of LCG jobs: correlation of information between jobs and resources; scheduling algorithms (2014). Virtualization of the Grid clusters: information correlation between jobs and virtual worker nodes; live-migration algorithms (2014). Testing middleware performance for HPC clusters in RO-LCG (2014). Open cloud interfaces (2015).

Page 17: National contribution to the development of the LCG computing … · 2017. 12. 27. · INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru RO-LCG

22.10.2013, IFA 17

THANK YOU FOR YOUR ATTENTION !

ISAB Meeting, IFA, Magurele, 22.10.2013