lhcbcomputing lessons learnt from run i. lhcbcomputing lessons learnt from run i growing pains of a...

LHCbComputing

Lessons learnt from Run I

LHCbComputing

Lessons learnt from Run I

Growing pains of a lively teenager

The toddler years

m Gaudi: 1998o Object oriented, C++9x, STL, (CERNLIB)

d NO vector instructions (pre SSE2)

o Architecture done from scratchP But needed backward compatibility from the start:

d Existing simulation/reconstruction from TP (SICB)d TDR studies had to be supported throughout

m DIRAC: 2002o Supporting major production and user analysis

uninterrupted since DC04

m Computing TDR: 2005o Computing model designed for 2kHz HLT rate

3

4

Growing up

m An incremental approach to deployment of new featureso Adapting to changing requirements and environment

P E.g. one order of magnitude increase in HLT output rateo Learn from the past, throw away what works badly,

keep and improve what works wellm Development in parallel with running production

systemo Physics software in production since 1996

P detector design, detector optimisation, HLT+physics preparation, physics exploitation

o Production system continuously supporting major productions and analysis since 2004

m Strong constraint also for the futureo Continue to support running experimento Continue to support analysis of legacy data

P Minimise pain of maintenance by supporting legacy data in new software

o Do not underestimate training effort to keep users on board

6

Andrei Tsaregorodtsev – Sept 2004

7

Reconstruction

m Experience from Run 1:o Reco14 vs. Reco12 (reprocessing of 2011 data)

P Significant differences in signal selection o Reco14 (first reconstruction of 2012 data):

P If we can provide calibration “fast enough”, we do not need reprocessing

m Run 2 strategy for optimal selection efficiency:o Online calibration

P Use it both online and offline, reprocessing becomes unnecessary

o Online reconstructionP Sufficiently fast to run identical code online and offline

m Given fast enough code and/or sufficient resources in the HLT farm, could skip “offline” reconstruction altogethero Opens up new possibilities

P Reduced need for RAW data offline (c.f. TURBO stream)P Optimise reconstruction for HLT farm hardware

Validation is key

m Long history of code improvementso But fixes mostly done as result of “crises” in production

P “St.Petersburg crisis” (July 2010)d Explosion of CPU time due to combinatorics at high (>2) mu

Was not foreseen, had never been properly tested with simulation Introduction of GECs

P “Easter crisis” (April 2012)d Memory leaks in new code insufficiently tested on large enough

scaleP pA reconstruction (February 2013)

d Events killed by GECs introduced in 2010… Big effort to understand tails, bugs found, optimisations done so cuts

could be relaxedm Better strategy in place for 2015 startup

o Turbo-validation streamP Allowed to study online-offline differencesP Huge validation effort to understand (and fix) differences

m But still fighting with fireo PbPb / PbAr reconstruction

P Starting now… Experience from 2013 should help8

Validation Challenges

m Clearly we do not do enough validationo We don’t give ourselves enough timeo The tools are difficult to useo It has little visibility

P Fire-fighters are more sexy than building safety officers!m We need to be better organised and have formal

goalso Online-Offline differences work was a good example

P Well recognised, accepted metric, with regular reportingm We have a huge army of “validators”, they are the

physics analystso Get them more involved in defining and testing

functionalityo Use current data-taking and analysis to test,

commission and deploy new concepts for the futurem Where software has to be written from scratch, be

more formal about software quality and validationo Needs a cultural change

9

Event selection (a.k.a. Trigger and Stripping)

m Event selection has moved closer to the detectorP Computing TDR (2005):

d 2kHz HLT rate (of which 200 Hz b-exclusive)d 9.6 reduction factor in Stripping

P Run 1:d 5kHz HLT rated ~2.0 reduction factor in Stripping (S21)

P Run 2 (1866 colliding bunches)d 22.6kHz HLT rate !! (18kHz FULL, 4kHz Turbo, 0.6kHz Turcal)d <2.0 reduction in Stripping

m Rethink role of “Stripping” o Most of event rate reduction now done in HLTo Event size reduction

P Removes RAW selectivelyd If not needed for offline, why do we even write it out of HLT?

P Reduces DST size selectively (various MDST streams)d TURBO does something similar, already in HLT…

o Streaming reduces number of events to handle in an analysisP Could we use an index instead? 10

Analysis model

m One size does not fit allo DST, MDST, Turboo Compromise between event rate and ability to redo

(parts of) the reconstruction

m What about N-tuples?o Completely outside current computing model, but

these are what analysts mostly access

m Should access to stripping output be reduced, in favour of centralised N-Tuple productiono Greater role for working group productionso c.f. Alice analysis trains

m One size does not fit all also for simulationo Fast simulation options

11

Data Popularity, disk copies

m 25% of disk space occupied by files not accessed since one year

m Can we be more clever?o Optimal number of copies?o More active use of tape?o Introduce working group N-Tuples into the model?

12

Software preservation

m We have big difficulties maintaining old software operationalo e.g. Reco14 and Stripping21 cannot run as is with

Sim09P Due to ROOT5 – ROOT6 incompatibilities

o e.g. “swimming” Run 1 trigger on Reco14 dataP Due to new classes on Reco14 DSTs, not known to Moore

in 2012

m Similar problems with CondDBo Increasingly complex dependencies between:

P software versions (new functionality, new calibration methods)

P real detector geometry changesP Different calibration or simulation versions

m Need much improved continuous validation of important workflowso To catch backward incompatibilities when they

happeno To make introduction of new concepts less error

prone

13

Infrastructure changes

m Sometimes necessary, sometimes desirable, always longo e.g. CMT to CMake

P Basic tools ready since longP Fully supported by most recent releasesP Not transparent to users, requires training and

convincingP Large body of legacy scripts difficult to migrate

d e.g. SetupProject for Run1 versions of Mooreo e.g. CVS to SVN

P Huge amount of preparatory workP Relatively fast and painless user migration

o e.g. Job OptionsP Still plenty of .opts files in production releases

Lbglimpse opts DaVinci v38r0 | grep ".opts:" | grep -v GAUDI | wc –l238

m Supporting two systems for too longo How can we go faster?

P And keep everyone on board?o What about old software stacks?

14

External collaboration

m History of successful examplesP ROOT, GEANT4, CERNLIB, Gaudi, Generators….

o And less successful onesP LCG middleware

o Sometimes not obvious that pain is worth the gain

m We cannot ignore realityo Funding agencies increasingly asking questions

about our software sharingo We are too few to do everything ourselves

m Development is fun, maintenance less soo Including a (long term) maintenance strategy might

make compromises with third parties more attractive

15

LHCbComputing

Ready to become a (young) adult?

lhcbcomputing lessons learnt from run i. lhcbcomputing lessons learnt from run i growing pains of a...

Documents