alice t1/t2 workshop 4-6 june 2013 ccin2p3 lyon famous last words
TRANSCRIPT
ALICE T1/T2 workshop ALICE T1/T2 workshop 4-6 June 20134-6 June 2013CCIN2P3 LyonCCIN2P3 Lyon
Famous last wordsFamous last words
22
ALICE T1/T2 workshopsALICE T1/T2 workshops• Yearly event
• 2011 – CERN• 2012 – KIT (Germany)• 2013 – CCIN2P3 (Lyon)
• Aims at gathering middleware and storage software developers, grid operation and network experts, and site administrators involved in ALICE computing activities
33
Some stats of the Lyon workshopSome stats of the Lyon workshop• 46 registered participants, 45 attended
• Good attendance, clearly these venues are still popular and needed
• 24 presentations over 5 session• 9 general on operations, software, procedures• 15 site-specific
• Appropriate number of coffee and lunch breaks, social events• Ample time for questions (numerous) and discussion (lively), true workshop style
44
ThemesThemes• Operations summary• WLCG middleware/services• Monitoring• Networking: LHCONE and IPv6• Storage: xrood v4. and EOS• CVMFS and AliRoot• Site operations, upgrades and (new) projects, gripes (actually none…)
55
Messages digest from the Messages digest from the presentationspresentations
• The original slides are available at the workshop indico page• Operations
• Successful year for ALICE and Grid operations – smooth and generally problem free, incident handling is mature and fast
• No changes foreseen to the operations principles and communication channels
• 2013/2014 (LHC LS1) will be years of data reprocessing and infrastructure upgrade
• The focus is on analysis – how to make it more efficient
66
Messages (2)Messages (2)• WLCG middleware
• CVMFS installed on many sites, leverage ALICE deployment and tuning through the existing TF
• WLCG VO-box is there and everyone should update• All EMI-3 products can be used• SHA-2 is on the horizon, services must be made
compatible• glExec – hey, it is still alive!
• Agile Infrastructure – IaaS, SaaS (for now)• OpenStack (Cinder, Keystone, Nova, Horizon, Glance)• Management through Puppet (Foreman, MPM,
PuppetDB, Hiera, git) … and Facter• Storage with Ceph• All of the above – prototyping and tests, ramping up
77
Messages (3)Messages (3)• Site dashboard
• http://alimonitor.cern.ch/siteinfo/issues.jsp• Get on the above link and start fixing, if you are on the
list• LHCONE
• The figure speaks for itself• All T2s shouldget involved• Instructions,expert lists are in the presentation
88
Messages (4)Messages (4)• IPv6 and ALICE
• IPv4 address space almost depleted, IPv6 is being deployed (CERN, 3 ALICE sites already)
• Not all services are IPv6-ready – test and adjustment is needed
• Cool history of the network bw evolution
• Xrootd 4.0.0• Complete client rewrite, new caching, non-blocking
request (client call-back), new user classes for metadata and data operations, IPv6 ready
• Impressive speedup for large operations• API redesigned, no backward compatibility, some cli
commands change names• ROOT plugin ready and being tested• Mid-July release target
99
Messages (5)Messages (5)• EOS
• Main disk storage manager at CERN, 45PB deployed 32PB used (9.9/8/3 ALICE)
• Designed to work with cheap storage servers, uses software raid (RAIN), ppm probability of file loss
• Impressive array of control and service tools (operations in mind)
• Even more impressive benchmarks…• Site installation – read carefully the pros/cons to
decide if it is good for you• Support – best effort, xrootd type
1010
Messages (6)Messages (6)• ALICE production and analysis software
• AliRoot is “one software to rule them all” in ALICE offline
• >150 developers, analysis 1M SLOC, reconstruction, simulation, calibration, alignment, visualization: ~1.4M SLOC, supported on many platforms and flavors
• In development since 1(8)998• Sophisticated MC framework with embedded physics
generators, using G3 and G4• Incorporates the full calibration code, which is also run
on-line and in HLT (code share)• Encapsulates fully the analysis, a lot of work on
improving it, more quality and control checks needed• Efforts to reduce memory consumption in reco• G4 and Fluka in MC
1111
Messages (7)Messages (7)
•CVMFS – timeline and procedures• Mature, scalable and supported product• Used by all other LHC experiments (and beyond)• Based on proven CernVM Family• Enabling technology for Clouds, CernVM as a user
interface, Virtual Analysis Facilities, opportunistic resources, Volunteer computing, part of a Long Term Data Preservation
• April 2014 – CVMFS on all sites, only method of sw distribution for ALICE
1212
Sites Messages (1)Sites Messages (1)• UK
• GridPP T1+19, RAL, Oxford and Birmingham for ALICE
• Smooth operation, ALICE can (and does) run beyond its pledge, occasional problems with job memory
• Test of cloud on small scale• RMKI_KFKI
• Shared CMS/ALICE (170 cores, 72TB disk)• Good resources delivery• Fast turnaround of experts, good documentation on
operations is a must (done)
1313
Sites Messages (2)Sites Messages (2)• KISTI
• Extended support team of 8 people • Tape system tested with RAW data from CERN• Network still to be debugged, but not a showstopper• CPU to be ramped up x2 in 2013• Well on its way to be the first T1 since the big T1 bang
• NDGF• Lose some (PDC), get some more cores (CSC)• Smooth going, dCache will stay and will get location
information to improve efficiency• The 0.0009 (reported, not real) efficiency at DCSC/KU
still a mystery, however it hurts NDGF as a whole, must be fixed
1414
Sites Messages (3)Sites Messages (3)• Italy
• New head honcho – Domenico Elia (grazie Massimo!)• Funding is tough, National Research Projects help a lot
for manpower, PON helps with hardware in the south• 6T2s and a T1 – smooth delivery and generally no
issues• Torino is a hotbed of new technology – Clouds
(OpenNebula, GlusterFS, OpenWRT)• TAF is open for business, completely virtual (surprise!)
• Prague• The city is (partially) under water• Current 3.7cores 2PB disk, shared LHC/D0,
contributes ~1.5% Grid resources of ALICE+ATLAS• Stable operation, distributed storage• Funding situation is degrading
1515
Sites Messages (4)Sites Messages (4)• US
• LLNL+LBL resources purchasing is complimentary and fits well to cover changing requirements
• CPU pledges fulfilled, SE a bit underused, on the rise• Infestation of the ‘zombie grass’ jobs, this is California,
something of this sort was to be expected…• Possibility for tape storage at LBL (potential T1)
• France• 8T2s, 1T1, providing 10% of WLCG power, steady
operation • Emphasis on common solutions for services and
support• All centres are in LHCONE (7in+7out PB have already
passed through it)• Flat resources provisioning for the next 4 year
1616
Sites Messages (5)Sites Messages (5)• India (Kolkata)
• Provides about 1.2% of ALICE resources• Innovative cooling solution, all issues of the past
solved, stable operation• Plans for steady resources expansion
• Germany• 2T2s, 1T1 – the largest T1 in WLCG, provides ~50%
of ALICE T1 resources• Good centre names: Hessisches
Hochleistungsrechenzentrum Goethe Universität (requires 180IQ to say it)
• The T2s have heterogeneous installation (both batch and storage), support many non-LHC groups, well integrated in the ALICE Grid, smooth delivery
1717
Sites Messages (6)Sites Messages (6)• Slovakia
• Since 2006 In ALICE• Serves ALICE/ATLAS/HONE• Upgrades planed for air-conditioning and power, later
CPU and disk, expert support is a concern• Reliable and steady resources provision
• RDIG• RRC-KI (toward T1): Hardware (CPU/Storage) rollout,
service installation and validation, personnel is in place, pilot testing with ATLAS payloads
• 8T2s + JRAF + PoD@SPbSU, deliver ~5% of the ALICE Grid resources, historically support all LHC VOs
• Plans for steady growth and sites consolidation• As all others, reliable and smooth operation
1818
Social eventsSocial events
1919
Victory!Victory!
How are you so cool underpressure?
I work at a T1!
2020
The group The group