Download - WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas

WLCG Tier-2 site in Prague: a little bit of WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectiveshistory, current status and future perspectives

Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas Kouba, Milos Lokajicek, Jan Svec

Prague 04 .09. 2014

OutlineOutline Introducing the WLCG Tier-2 site in Prague

A couple of history flashbacks we celebrate the 10th anniversary

Current issues

Summary and Outlook

HEP Computing in Prague: site praguelcg2 HEP Computing in Prague: site praguelcg2 (a.k.a. the farm (a.k.a. the farm GOLIGOLIAS)AS)

• A national computing center for processing data from various HEP experiments– Located in the Institute of Physics (FZU) in Prague – Basic infrastructure already in 2002, but

– OFFICIALLY STARTED IN 2004 10th ANNIVERSARY THIS YEAR

• Certified as a Tier2 center of LHC Computing Grid (praguelcg2) – Collaboration with several Grid projects.

• April 2008, WLCG MoU signed by Czech republic (ALICE+ATLAS).

• Excellent network connectivity: Multiple dedicated 1 – 10 Gb/s connections to collaborating institutions. Connected to LHCONE.

• Provides computing services for ATLAS + ALICE, D0, Solid state physics, Auger, Star ...

• Started in 2002 with:• 32 dual PIII 1.2GHz, 1 GB RAM, 18 GB SCSI HDD, 100 Mb/s Ethernet rack servers …. (29 of these

decommissioned in 2009)• Storage - disk array 1TB: HP server TC4100

History: 2002 -> 2014History: 2002 -> 2014

Current numbersCurrent numbers

5

•1 batch system (torque + maui)•2 main WLCG VOs: ALICE, ATLAS

– FNAL's D0 (dzero) user group– Other VOs: Auger, Star

• ~ 4000 cores published in the Grid• ~ 3 PB on new disk servers (DPM, XRootD, NFS)•Regular yearly upscale of resources on the basis of various financial supports, mainly the academic grants. •The WLCG services include:

– Apel publisher, Argus Authorization service, BDII, several UIs, Alice VOBOX, Cream CEs, Storage Elements

The use of virtualization at the site is quite extensive.

•ALICE disk XRootD Storage Element ALICE::Prague::SE– ~ 1.113 PB of disk space in total– Redirector/client + 3 clients @ FZU, 5 clients @ NPI Rez a distributed storage cluster

Site UsageSite Usage

ATLAS and ALICE – continuous productionother projects – shorter campaigns

ALICEALICE

ATLASATLAS

Some history flashbacks Some history flashbacks (celebrating the 10(celebrating the 10thth anniversary) anniversary)

88

ALICE PDC 2004 ALICE PDC 2004 resource statistics: resource statistics: 14 sites14 sites

ALICE 2014 resource statistics: 74 sites

99

ALICE PDC ALICE PDC resources statistics resources statistics - 2005- 2005

25 sites in operation

Running jobs (8 November 2005)

Farm Min Avg Max

Sum 1160 1651 1771

CCIN2P3 134 210 231

CERN-L 268 286 304

CNAF 255 362 394

FZK 0 531 600

Houston 0 3 14

Münster 2 58 81

Prague 43 61 71

Sejong 2 2 2

Torino 33 41 43

20062006• ALICE vobox set-up

– fixing problems with the vobox proxy (unwanted expirations)

– AliEn services set up– manually changing the RBs used by the JAs successful participation in the ALICE PDC'06: – Prague site delivered ~ 5% of total computing

resources (6 Tier1s, 30 Tier2s)

• Problems with the fair-share of the site local batch system (then PBSPro)

20072007• still problems with functioning of the ALICE vobox-

proxy • during the PDC'07 problems with job submission due

to malfunctions of the default RBs the failover submission configured

• Prague site delivered ~ 2.6% of total computing resources (significant increase of the number of Tier-2s)

• migration to gLite3.1 ALICE vobox on 64-bit SLC4 machine

• upgrade of the local CE serving ALICE to lcg-CE 3

• repeating problems with job submission through RB's in Oct. the site re-configured for the WMS submission

• migration to the Torque batch system on a part of the site: some WNs on 32bit in PBS and some on 64bit in Torque

• installation and tuning of the creamCE • hybrid state:

– ‘glite’ vobox and WNs, 32bit– ‘cream’ vobox submitting JAs directly to creamCE

Torque, 64bit– Dec: ALICE jobs submitted only to the creamCE

20082008 20092009

20102010• creamCE 1.6 / gLite 3.2/sl5 64bit installed in Prague we were the first ALICE Tier-2 where cream1.6 was tested and put in

production• NGI_CZ set in operation

20112011• Start of IPv6 implementation• The site router got an IPv6 address• Routing set-up in special VLANs• ACLs directly implemented in the router• IPv6 address configuration: DHCPv6• Set-up of an IPv6 testbed

20122012• Optimization of the ALICE XRootD storage cluster performance• an extensive tuning of the cluster motivated by a remarkably different

performance of the individual machines:– data was migrated from the machine to be tuned to free disk arrays at another machine of the cluster.– the migration procedure done so that the data was accessible all the time.– the empty machine re-configured.– number of disks in one array reduced.– set-up of disk failure monitoring.– raid controller cache carefully configured.– readahead option set to a multiple of (stripe_unit * stripe_width) of the underlying RAID array.– no partition table used to ensure proper alignment of the file systems: they were created with right

geometry options ("-d su=X, w=YYk“ mmkfs.xfs switches).– mounting performed with the noatime option.

Parameters of one of the optimized XRootD servers before and after tuning

20132013Almost all machines migrated to SL6CVMFS installed on all machinesConnected to LHCONE

praguelcg2 contribution to WLCG Tier-2 praguelcg2 contribution to WLCG Tier-2 ATLAS+ALICE computing resourcesATLAS+ALICE computing resources

• http://accounting.egi.eu/• A long-term slide down due to problems with financial support

http://accounting.egi.eu/

Current issues Current issues

Monitoring issuesMonitoring issues• A number of monitoring tools in use: NAGIOS, MUNIN, GANGLIA, MRTG,

NETFLOW, Gstat, MonALISA• Nagios:

– IPv6-only or IPv4-only servers connected to the central Dual stack node via Livestatus– Some checks can be run form IPv4-only or IPv6-only Nagios nodes

• MUNIN2:– current version 2.0.19– IPv6 in testing

• Ganglia:

– problems if the proper gai.conf is not present– gmetad doesn’t bind to IPv6 address on aggregators

• NetFlow:– plan to switch from v5 to v9 to use nfdump + nfsen

• Some new sensors are needed to fully deploy IPv6, some additional work necessary

• MonALISA REPOSITORY:– A simple test version installed, plans for future development

Network monitoring – weathermap

LHCONE link is heavily utilized(capacity 10 Gbps)

Nagios for alerts

Network architecture at FZUNetwork architecture at FZU

Outgoing IPv4 local traffic from DPM servers

Outgoing IPv6 local traffic from DPM servers

IPv6 deploymentIPv6 deployment

•Currently on Dual-stack: dpm headnode, all production disk nodes, all but 2 subclusters of WNs•Over IPv6 goes: dpns between disknodes and headnode, srm between WNs and headnode, actual data transfer via gridftp•IPv6 enabled on the ALICE vobox

Site services managementSite services management

• Since 2008 services management done with CFEngine version 2– cfagent Nagios sensor developed: a python script checking CFEngine logs for

fresh records (error signals if the log is too old)

• CFEngine v2 used for production

• Puppet used for IPv6 testbed

• Migration to the overall Puppet management in progress

NGI_CZNGI_CZ

• Since 2010, NGI_CZ is recognized and in operation: https://wiki.metacentrum.cz/metawiki/NGI_CZ#Farma_golias_aka_praguelcg2

• all the events and relevant information about praguelcg2• 2 sites involved: praguelcg2 and prague_cesnet_lcg2• significant part of the services provided by the praguelcg2 team

Services provided by NGI_CZ for the EGI infrastructure: Accounting (APEL, DGAS, Cesga portal) Resources database (GOC DB) Operations - https://operations-portal.egi.eu/

ROD (Regional Operator on Duty) Top level BDII

VOMS servers Meta VO User support (GGUS/RT) - https://rt4.cesnet.cz/rt/

Middleware versions: UMD 3.0.0, EMI 3.0

https://wiki.metacentrum.cz/metawiki/NGI_CZ#Farma_golias_aka_praguelcg2

https://operations-portal.egi.eu/

Use of external resourcesUse of external resources

• Not much really to choose from• Longer term usage of the cluster ‘skurut’ in Prague: site

prague_cesnet_lcg2, courtesy of CESNET association– a long-time established cooperation

• NGI_CZ provided a single opportunity to use ~ 35 TB disk storage in Pilsen– for testing purposes mostly– dCache manager used– Evaluating the effect of switching/tuning TTreeCache , dCap RA– Not much of help as an extension of home resources

Summary and OutlookSummary and Outlook Prague Tier-2 site was performing as a distinguished member of the WLCG

collaboration for 10 years nowA stable upscale of resourcesHigh-level accessibility, reliable delivery of services, fast response to problems

Into the upcoming years, we will do our best to keep up the reliability and performance level of the services

Crucial is the high-capacity, state-of-the-art network infrastructure provided by CESNET

However, the future LHC runs will require a huge upscale of resources which will be impossible for us to achieve with the expected flat budget

As everybody else these days, we are in a search for external resources: got some help from CESNET but need more. As widely recommended, we very likely will try to collaborate with non-HEP scientific projects to get access to additional resources in the future

A couple of current plots A couple of current plots

RUNNING ALICE JOBS IN PRAGUE in 2013/2014:Average = 996, maximum = 2227Total number of processed jobs: ~ 5 millions

GRID for ALICE in Prague – Monitoring jobs (MonALISA)GRID for ALICE in Prague – Monitoring jobs (MonALISA)

2626

ALICE Disk Storage Elements – 62 endpoints, ~ 34 PBALICE Disk Storage Elements – 62 endpoints, ~ 34 PBPrague scores with the largest Tier-2 storagePrague scores with the largest Tier-2 storage

NETWORK TRAFFIC ON PRAGUE ALICE STORAGE CLUSTER in 2013/2014:

(Total disk space capacity 1.113 PB)

Max total traffic IN/write: 195 MB/sMax total traffic OUT/read: 1.05 GB/s

Total data OUT/read : 5.322 PB

GRID for ALICE in Prague – Monitoring storage (MonALISA)GRID for ALICE in Prague – Monitoring storage (MonALISA)

Download - WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas

Top Related