national contribution to the development of the lcg computing … · 2017. 12. 27. · incdtim grid...
TRANSCRIPT
22.10.2013, IFA 1
National contribution to the development of
the LCG computing grid for elementary
particle physics
- CONDEGRID -
Department of Computational Physics and Information Technologies 'Horia Hulubei' National Institute for R&D in Physics
and Nuclear Engineering (IFIN-HH), Magurele, Romania
M. Dulea
ISAB Meeting, IFA, Magurele, 22.10.2013
22.10.2013, IFA 2
OVERVIEW
1. EXPERIMENTAL CONTEXT
2. ACHIEVEMENTS: GRID RESOURCES
3. ACHIEVEMENTS: SERVICE RELIABILITY and GRID PRODUCTION
4. ACHIEVEMENTS: SITE OPTIMIZATION
5. ACHIEVEMENTS: ADOPTING NEW TECHNOLOGIES and INITIATIVES
6. ACHIEVEMENTS: TRAINING and DISSEMINATION
7. ACHIEVEMENTS: MANAGEMENT and COORDINATION
8. ADDITIONAL ACTIVITIES
9. PROSPECTS for 2014 and 2015
3
1 – EXPERIMENTAL CONTEXT
22.10.2013, ISAB, IFA
ALICE - Has recently improved analysis job efficiency - plans transition from BitTorrent to CVMFS for distribution of ALICE software on WNs. - The computing consequences of the planned upgrade will be presented in a Computing TDR in Oct. ‘14
Preparing Run-2 Update of the Computing Models of the WLCG and the LHC Experiments – draft vers. Sept. 2013 - goal: to minimize operational costs. New: using opportunistic resources; computing as a service; etc.
ATLAS - The Run-2 Computing Model improves CPU/event consumption and AOD event sizes of data reconstruction, MC production chain and analysis - ATLAS resource management is modified such as to keep resource requirements within reasonable limits - Preparation of ATLAS Data Challenge (June 2014)
LHCb - All data reprocessed in spring 2013 - Reduction by factor 2 in event size, by improving compression/output format) - Growing need of disk space; 50 TB/week increase in disk usage; sw. efforts to reduce disk consumption - Decision to allow disk and analysis at some selected Tier2 sites
4
1 - OVERALL RESOURCES PLEDGES and REQUIREMENTS
22.10.2013, ISAB, IFA
http://gstat-wlcg.cern.ch/apps/pledges/summary/, 05.10.2013
Tier2 pledges & needs =>
ALICE “has historically faced a lack of resources” Following CRSG’s recommendations, ALICE lowered the required levels, but this induced an even lower offer (T1s)
ATLAS “is using its pledged (and non-pledged) resources in full and consequently believes there is a solid case for an increase in the CRSG resource recommendations (04) for the 2014 RRB year” (ATLAS Comp. Res. Usage March 2013 - September 2013)
Tier-2s “provided more resources than pledged. All resources were fully used, mainly for MC simulation, user analysis (and prerequisite group production), and MC reconstruction”
Concerns: “There remains a potential danger that LHC physics output could be limited by having insufficient computing resources” (LHCC deliberations, 04.2013)
5
2 - RESOURCE PROVISION
22.10.2013, ISAB, IFA
First task assumed by RO-LCG through MoU: to provide disk storage resources and computing power for Monte Carlo simulations and data analysis needed by three LHC experiments: ALICE, ATLAS, LHCb
RO-LCG provides Grid resources and services through 7 computing centres:
Upgrade of resources during Y2 of 8EU:
Computing: ~ 4800 cores => ~ 6000 cores
Storage: ~ 1.8 PetaBytes => ~ 2.2 PetaBytes
1. Resource centres funded from 8EU
RO-07-NIPNE (alice, atlas, lhcb)
RO-13-ISS (alice)
RO-14-ITIM (atlas)
RO-16-UAIC (atlas)
2. Resource centres (of the experimental groups @ IFIN-HH) that are not funded from 8EU
NIHAM (alice)
RO-02-NIPNE (atlas)
RO-11-NIPNE (lhcb)
8EU supports coordination, monitoring, and activity reporting for all RO-LCG grid sites (including experiments' sites).
6
2 - FULFILLMENT of the RESOURCE PLEDGE
22.10.2013, ISAB, IFA
http://gstat-wlcg.cern.ch/apps/topology/federation/252/, 05.10.2013
R: The pledge target is overachieved
Second MoU task: to fulfill the annual resource pledge
The resource pledge for 2013-14 was established in 2012, in agreement with the experimental groups and with the CRSG requests
The fulfillment of the 2013 pledge was based on the 2012 upgrades
22.10.2013, IFA 7
3 - SERVICE RELIABILITY and GRID PRODUCTION
RO-LCG GRID PRODUCTION
Jan.-Oct. 2012: 85.3 million HS06-hours run.
Jan.-Sep. 2013: 110 million HS06-hours run.
The grid production of the sites supported by 8EU represents 60% of the total RO-LCG production
Reliability coefficient of RO-07-NIPNE =>
Grid production (in kSI2k-hrs.), run on RO-07-NIPNE, on the other sites supported from the 8EU project, and on the RO-
LCG sites that are not supported from the 8EU project =>
Third MoU task: to fulfill the service level agreement (SLA) stipulated in the MoU
Average reliability coefficient of the main site supported by 8EU, RO-07-NIPNE: 95%
Overachievement by other 8EU sites:
RO-13-ISS: 100%, RO-16-UAIC: 98%
22.10.2013, IFA 8
3 - GRID PRODUCTION in Jan.-Sep. 2013
R: RO-LCG ranks 11th among the 34 Tier-2 national centers regarding the cumulated ALICE+ATLAS+LHCb CPU hours, with a share of 2.18% of the total R: RO-LCG ranks 8th among the 34 Tier-2 national centers regarding the cumulated ALICE+ATLAS+LHCb number of jobs, with a share of 2.76% of the total
9
Total CPU hours (kSI2k) and number of jobs recorded for RO-LCG during Sep.-Aug. intervals. The 2012-2013 values reflect the increase of the share of ATLAS analysis and ALICE (longer) jobs (src.: EGI accounting portal) =>
Y2 activity for RO-07-NIPNE (Jan.-Sept. 2013):
alice: production jobs, using CE1 (tbit01)
> 660000 jobs, ~ 2.47 MCPU hours
atlas: production & analysis jobs, using CE2 (tbit03)
~ 2000000 jobs, ~ 6.24 MCPU hours
lhcb: reprocessing jobs, using CE2
> 100000 jobs, ~ 1.45 MCPU hours
The improvement of ATLAS analysis capacity @ RO-07 strongly depends on the reliability of the external connection to the French ATLAS Cloud, and on the efficiency of data handling within the site.
0
5.000.000
10.000.000
15.000.000
20.000.000
25.000.000
30.000.000
35.000.000
Jobs
CPU hours
IFIN GRID
RO-07
4 - OPTIMIZATION OF RO-07-NIPNE
10
Left: Topology of RO-07-NIPNE. Right: After optimization, an overall traffic of 11 Gb/s from the storage server to WNs was reached during the concurent running of 500 analysis jobs
The optimum writing speed on the WN’s disk can be reached using the best configuration both in the disk server and in the WN (throughput of disk write). Example: improving the performance of a storage facility with max transfer speed 8 Gbps.
Necessary to ensure the scalability of the cluster, by increasing at a constant rate the bandwidth available for data transfer whenever the storage capacity is upgraded
SE-WN bandwidth upgrades through switch cascading (stack configurations)
4 - OPTIMIZATION OF RO-07-NIPNE
CD-CC bandwidth upgrades (currently 2x10 Gbps)
SE design: special issues due to: a) 1 Gbps links; b) distribution of hardware in 2 separate locations (CD/CC); c) min. DD-WN transfer speed of 2.5 MB/s required ; d) repeated upgrades of the distributed and heterogeneous storage. Balance between the storage capacity and the network throughput of each data disk server must exist.
11
Reaching the bandwidth saturation of the external link @ Magurele
4 - OPTIMIZATION OF RO-07-NIPNE
The large data transfer peaks mainly due to the grid sites, recorded since 2012, demonstrated the need of provision of a bandwidth larger than 10 Gbps. Buying a BGP router was justified. An upgrade to 40 Gbps is planned.
The transfer capacity of the FO link Magurele - RoEduNet NOC has reached its limits
22.10.2013, IFA 12
5 - ADOPTING NEW TECHNOLOGIES AND INITIATIVES
LHCONE (LHC Open Network Environment): a network that is private to the LHC Tier-1/2/3 sites ; provides better access to the datasets for the whole HEP community.
R: Integration of the LCG sites from Magurele into LHCONE (July 2013) Grid traffic separated from the common data traffic. A separate VLAN was defined on the existing network. Beneficiaries: first, ATLAS sites RO-02, RO-07; next, NIHAM, RO-11, RO-13-ISS At present all RO-LCG sites benefit (ATLAS) or will benefit of it when experiments decide (ALICE, LHCb)
Direct T2 (T2D) sites: reliable, primary hosts for ATLAS datasets (analysis) and group analysis, get and send data from different clouds, take part in cross cloud production
Test if transfer rates with at least 9 T1s are larger than 5 MB/s, for some periods of time (see Fig.). To find weak points, the transfer capacity of the link Magurele - RoEduNet NOC – GEANT was tested with 3 couples of perfSONAR servers (for bandwidth and latency), located at DFCTI, UPB, and in RoEduNet’s NOC.
Preparing for the deployment of IPV6
R: RO-07-NIPNE started tests for T2D qualification
The poor performance recorded was an argument for replacing the connection provider.
Two IPV6 classess (/48, /64) were acquired; tests between IFIN GRID Servers and RO-16-UAIC servers. Collaboration with RoEduNet.
22.10.2013, IFA 13
6 - TRAINING & DISSEMINATION
11.04 Calcul Atlas France (CAF) Meeting, Grenoble
27.09 RO-LCG 2013 Workshop - “Grid, Cloud and High Performance Computing in Science”, 'Mircea cel Batran' Naval Academy, Constanta (IEEE Proceedings)
14-18.10 Training for Grid operators and administrators, IFIN-HH.
CAF Meeting
RO-07-NIPNE STATUS, M. Ciubancan INCDTIM Grid Site RO-14-ITIM, F. Farcas Status of RO-16-UAIC Grid site in 2013, C. Panzaru
RO-LCG 2013
Data transfer optimisation for the DFCTI grid site, M. Ciubancan, S. Constantinescu, M. Dulea Monitoring LHCONE traffic in Romanian grid sites, C. Pinzaru, V. Vraciu, P. Gasner, M. Subredu, O. Rusu RO-14-ITIM upgrade to EMI3, F. Farcas, D. Nicoara, J. Nagy ISS Monitoring Technologies and Status Report on Grid Activities, I. Stan, A. Sevcenco, S. Zgura, L. Irimia
22.10.2013, IFA 14
7 - MANAGEMENT & COORDINATION
RO-LCG Report: Initiative started in January 2013 Discussed at the RO-LCG Meeting 12.02.2013 End of survey in February 2013. Final version presented to the recently designated Grid Committee (Ian Bird - WLCG Project Leader, Markus Schulz - WLCG Deployment Manager - IT Department, Jamie Shiers - WLCG Operations Manager - IT Department, Oliver Keeble - Leader of the EMI Data Management Project - IT Department, Frederic Hemmer - IT Department Head)
Following ISAB’s recommendadion, the Romanian WLCG Steering Committee (RWSC) was formed. The members of the RWSC are: WLCG project management: Mihnea Dulea ALICE coordinator/deputy: Mihai Petrovici; Claudiu Schiaua ATLAS coordinator/deputy: Calin Alexa; Gabriel Stoicea LHCb coordinator/deputy: Florin Maciuc; Alexandru Grecu
First meeting of the RWSC: CERN, 16.04.2013; chaired by Dir. Gen. Florin Buzatu (IFA) The members of the RWSC attend the monthly RO-LCG meetings.
22.10.2013, IFA 15
8 - ADDITIONAL ACTIVITIES
Collaboration with ATLAS France
Improvement of ATLAS analysis on RO-LCG sites. Participation of a RO specialist in ATLAS Distributed Computing Operations Shift & support team of FR Cloud (Squad). Training supported from 8EU.
Collaboration with LIT/JINR
Hulubei-Meshcheriakov programme, project “Optimization Investigations of the Grid and Parallel Computing Facilities at LIT-JINR and Magurele Campus”
Outreach activities
Continuing a program for high schools that started in 2012.
High school students from ‘Tudor Vianu' National College of Informatics visited the HPC Center @ IFIN-HH in February 2013 and followed a presentation on Computational Physics, including an introduction to LHC and Grid computing.
22.10.2013, IFA 16
9 - PROSPECTS 2014, 2015
Network support
T2D qualification of ATLAS sites (2014) 40 Gbps upgrade of the external link, in collaboration with RoEduNet (2014)
Grid site optimization
Further optimization of the local computing configurations for ATLAS and LHCb data reprocessing, to fulfill the requirements of a larger number of analysis jobs (2014)
Dissemination
RO-LCG conference, IFIN-HH (2014) RO-LCG workshop, ITIM, Cluj (2015)
New technologies and technological updates
Service for monitoring and control of resources in the framework of LCG jobs: correlation of information between jobs and resources; scheduling algorithms (2014). Virtualization of the Grid clusters: information correlation between jobs and virtual worker nodes; live-migration algorithms (2014). Testing middleware performance for HPC clusters in RO-LCG (2014). Open cloud interfaces (2015).
22.10.2013, IFA 17
THANK YOU FOR YOUR ATTENTION !
ISAB Meeting, IFA, Magurele, 22.10.2013