lhcbcomputing lessons learnt from run i. lhcbcomputing lessons learnt from run i growing pains of a...
DESCRIPTION
The toddler years m Gaudi: 1998 o Object oriented, C++9x, STL, (CERNLIB) d NO vector instructions (pre SSE2) o Architecture done from scratch P But needed backward compatibility from the start: d Existing simulation/reconstruction from TP (SICB) d TDR studies had to be supported throughout m DIRAC: 2002 o Supporting major production and user analysis uninterrupted since DC04 m Computing TDR: 2005 o Computing model designed for 2kHz HLT rate 3TRANSCRIPT
LHCbComputing
Lessons learnt from Run I
LHCbComputing
Lessons learnt from Run I
Growing pains of a lively teenager
The toddler years
m Gaudi: 1998o Object oriented, C++9x, STL, (CERNLIB)
d NO vector instructions (pre SSE2)
o Architecture done from scratchP But needed backward compatibility from the start:
d Existing simulation/reconstruction from TP (SICB)d TDR studies had to be supported throughout
m DIRAC: 2002o Supporting major production and user analysis
uninterrupted since DC04
m Computing TDR: 2005o Computing model designed for 2kHz HLT rate
3
4
Growing up
m An incremental approach to deployment of new featureso Adapting to changing requirements and environment
P E.g. one order of magnitude increase in HLT output rateo Learn from the past, throw away what works badly,
keep and improve what works wellm Development in parallel with running production
systemo Physics software in production since 1996
P detector design, detector optimisation, HLT+physics preparation, physics exploitation
o Production system continuously supporting major productions and analysis since 2004
m Strong constraint also for the futureo Continue to support running experimento Continue to support analysis of legacy data
P Minimise pain of maintenance by supporting legacy data in new software
o Do not underestimate training effort to keep users on board
5
6
Andrei Tsaregorodtsev – Sept 2004
7
Reconstruction
m Experience from Run 1:o Reco14 vs. Reco12 (reprocessing of 2011 data)
P Significant differences in signal selection o Reco14 (first reconstruction of 2012 data):
P If we can provide calibration “fast enough”, we do not need reprocessing
m Run 2 strategy for optimal selection efficiency:o Online calibration
P Use it both online and offline, reprocessing becomes unnecessary
o Online reconstructionP Sufficiently fast to run identical code online and offline
m Given fast enough code and/or sufficient resources in the HLT farm, could skip “offline” reconstruction altogethero Opens up new possibilities
P Reduced need for RAW data offline (c.f. TURBO stream)P Optimise reconstruction for HLT farm hardware
Validation is key
m Long history of code improvementso But fixes mostly done as result of “crises” in production
P “St.Petersburg crisis” (July 2010)d Explosion of CPU time due to combinatorics at high (>2) mu
Was not foreseen, had never been properly tested with simulation Introduction of GECs
P “Easter crisis” (April 2012)d Memory leaks in new code insufficiently tested on large enough
scaleP pA reconstruction (February 2013)
d Events killed by GECs introduced in 2010… Big effort to understand tails, bugs found, optimisations done so cuts
could be relaxedm Better strategy in place for 2015 startup
o Turbo-validation streamP Allowed to study online-offline differencesP Huge validation effort to understand (and fix) differences
m But still fighting with fireo PbPb / PbAr reconstruction
P Starting now… Experience from 2013 should help8
Validation Challenges
m Clearly we do not do enough validationo We don’t give ourselves enough timeo The tools are difficult to useo It has little visibility
P Fire-fighters are more sexy than building safety officers!m We need to be better organised and have formal
goalso Online-Offline differences work was a good example
P Well recognised, accepted metric, with regular reportingm We have a huge army of “validators”, they are the
physics analystso Get them more involved in defining and testing
functionalityo Use current data-taking and analysis to test,
commission and deploy new concepts for the futurem Where software has to be written from scratch, be
more formal about software quality and validationo Needs a cultural change
9
Event selection (a.k.a. Trigger and Stripping)
m Event selection has moved closer to the detectorP Computing TDR (2005):
d 2kHz HLT rate (of which 200 Hz b-exclusive)d 9.6 reduction factor in Stripping
P Run 1:d 5kHz HLT rated ~2.0 reduction factor in Stripping (S21)
P Run 2 (1866 colliding bunches)d 22.6kHz HLT rate !! (18kHz FULL, 4kHz Turbo, 0.6kHz Turcal)d <2.0 reduction in Stripping
m Rethink role of “Stripping” o Most of event rate reduction now done in HLTo Event size reduction
P Removes RAW selectivelyd If not needed for offline, why do we even write it out of HLT?
P Reduces DST size selectively (various MDST streams)d TURBO does something similar, already in HLT…
o Streaming reduces number of events to handle in an analysisP Could we use an index instead? 10
Analysis model
m One size does not fit allo DST, MDST, Turboo Compromise between event rate and ability to redo
(parts of) the reconstruction
m What about N-tuples?o Completely outside current computing model, but
these are what analysts mostly access
m Should access to stripping output be reduced, in favour of centralised N-Tuple productiono Greater role for working group productionso c.f. Alice analysis trains
m One size does not fit all also for simulationo Fast simulation options
11
Data Popularity, disk copies
m 25% of disk space occupied by files not accessed since one year
m Can we be more clever?o Optimal number of copies?o More active use of tape?o Introduce working group N-Tuples into the model?
12
Software preservation
m We have big difficulties maintaining old software operationalo e.g. Reco14 and Stripping21 cannot run as is with
Sim09P Due to ROOT5 – ROOT6 incompatibilities
o e.g. “swimming” Run 1 trigger on Reco14 dataP Due to new classes on Reco14 DSTs, not known to Moore
in 2012
m Similar problems with CondDBo Increasingly complex dependencies between:
P software versions (new functionality, new calibration methods)
P real detector geometry changesP Different calibration or simulation versions
m Need much improved continuous validation of important workflowso To catch backward incompatibilities when they
happeno To make introduction of new concepts less error
prone
13
Infrastructure changes
m Sometimes necessary, sometimes desirable, always longo e.g. CMT to CMake
P Basic tools ready since longP Fully supported by most recent releasesP Not transparent to users, requires training and
convincingP Large body of legacy scripts difficult to migrate
d e.g. SetupProject for Run1 versions of Mooreo e.g. CVS to SVN
P Huge amount of preparatory workP Relatively fast and painless user migration
o e.g. Job OptionsP Still plenty of .opts files in production releases
Lbglimpse opts DaVinci v38r0 | grep ".opts:" | grep -v GAUDI | wc –l238
m Supporting two systems for too longo How can we go faster?
P And keep everyone on board?o What about old software stacks?
14
External collaboration
m History of successful examplesP ROOT, GEANT4, CERNLIB, Gaudi, Generators….
o And less successful onesP LCG middleware
o Sometimes not obvious that pain is worth the gain
m We cannot ignore realityo Funding agencies increasingly asking questions
about our software sharingo We are too few to do everything ourselves
m Development is fun, maintenance less soo Including a (long term) maintenance strategy might
make compromises with third parties more attractive
15
LHCbComputing
Ready to become a (young) adult?