thomas jefferson national accelerator facility page 1 12 gev upgrade software review jefferson lab...
TRANSCRIPT
Thomas Jefferson National Accelerator Facility
Page 1
12 GeV Upgrade Software ReviewJefferson Lab
November 25-26, 2013
Software Project for Hall B 12 GeV Upgrade
Status and Plans
D. P. Weygand
Thomas Jefferson National Accelerator Facility
Page 2
Overview• CLAS12 Software Overview• Advances made in: 1a
– Framework– Simulations– Tracking– Calibration and Monitoring– Event Reconstruction– Data Processing
• Timelines and Milestones 1a/b– Milestones met and current timeline
• Useability 1c– Of framework (Data Mining Project), simulation (detector studies), and
reconstruction (detector performance analysis)• Software Profiling and Documentation 1d• Steps taken to address recommendations from previous review
2a• Risk Mitigation 2d• Summary
Thomas Jefferson National Accelerator Facility
Page 3
CLAS12 Offline SoftwareCOMPONENTS DESCRIPTION
GEMC Full-Geant4 based detector simulation
ced Event-level visualization
CLaRA SOA-based Physics Data Processing application development framework
DPE Physics Data Processing Environment in Java, C++, and Python
Service Containers Multi-threaded support
Application Orchestrators Cloud/ batch-farm support
Online Level-3 Data Processing ClaRA-based application deployment and operation on the online farm
Reconstruction Services
Tracking Charged-particle track reconstruction
FTOF/CTOF Time of flight reconstruction for PID
EC/PCAL Neutral particle identification
LTCC/HTCC K/pi separation
Forward Tagger Quasi-real photon tagger
PID Particle identification
Event Builder Entire event reconstruction
Calibration and Monitoring Services Detector calibration, monitoring of data processing, histogramming
Auxiliary Services Geometry service, magnetic field service
Calibration and Conditions Database Calibration and conditions database for on/offline constants
Data Analysis
DST Data summary tapes, data format for analysis
Data Mining Distributed data access
Thomas Jefferson National Accelerator Facility
Page 4
Computing Model and Architecture
• ClaRA (Clas12 Reconstruction & Analysis framework) is a multi-threaded analyses framework, based on a Service Oriented Architecture
• Physics application design/composition based on services• Services being developed 1a
– Charge particle tracking (central, forward)– EC reconstruction – TOF reconstruction– PID – HTCC– PCAL– Detector calibration– Event Builder– Histogram services– Database application
• Multilingual support 2b– Services can be written in C++, Java and Python
• Supports both traditional and cloud computing models 2f
• Single process as well as distributed application design modes• Centralized batch processing• Distributed cloud processing
Thomas Jefferson National Accelerator Facility
Page 5
Clas12 Event Reconstruction
R TOF PID
HTCC
CT
EC
EB
LTCC
PCAL
FT
KFKF
W
R/W: Reader/WriterCT: Central TrackingFT: Forward TrackingKF: Kalman FilterEC: Electromagnetic CalorimeterPCAL: PreShower CalorimeterTOF: Forward & Central Time-Of-FlightHTCC/LTCC: Threshold CerenkovEB: Event BuilderPID: Particle ID
Thomas Jefferson National Accelerator Facility
Page 6
Stress Tests
• Test using single data streams show that we can do online analysis to be able to analyze ~ 10% data-stream (2kHz)
• Test using multiple data-stream application scales with number of processing nodes (20 nodes used)
1b 2a 2f
0 2 4 6 8 10 120
0.5
1
1.5
2
2.5
f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652
Data Processing RateSingle datastream
Ethernet
Infiniband
RAM Disk IO
Linear (RAM Disk IO)
Number of processing nodes (32cores: 16 cores with hyperthreading )
kHz
Thomas Jefferson National Accelerator Facility
Page 7
R W
Administrative Services
ClaRA Master DPE
AO
Executive Node
Farm Node N
S1 S2 SnS1
S1S2
S2Sn
Sn
Multiple Data-stream Application
PersistentStorage
DS
PersistentStorage
Thomas Jefferson National Accelerator Facility
Page 8
0 2 4 6 8 10 12 14 16 18 20 220
1
2
3
4
5
f(x) = 0.2 xR² = 1
Data Processing RateMultiple datastreams
One data-file per processing nodeData-file contains 10K events
Number of processing nodes (32cores: 16 cores with hyperthreading )
kHz
Multiple Data-stream Application
Clas12 Reconstruction: JLAB batch farm
Thomas Jefferson National Accelerator Facility
Page 9
ClaRA Batch Processing WorkFlow
DPE: Data Processing EnvironmentSM: Shared MemoryDm: Data ManagerTm: Tape ManagerI/O: I/O ServicesS: Reconstruction Services
2f 3a/b
Thomas Jefferson National Accelerator Facility
Page 10
Jlab Workflow System
CA CLARA
COMD: E
INPUT: F
COMD: C
INPUT: D
Workflow System
script: A
INPUT: Bscript: C
INPUT:D*CLARAorchestrator
AugerPBS
Data Request
Workflow A Workflow C
Status Report
DPEJobs
1d
Thomas Jefferson National Accelerator Facility
Page 11
Advances within the Framework
• Batch Farm processing mode
Currently being integrated into large-scale workflow 2f 1d
• Service-based data-flow control application
• ClaRA Application Designer 1c
Graphical service composition
• Transient data streaming optimization EvIO 4.1 is the default event format for CLAS12.
o Data is just a byte buffer (avoid serialization).o Complete API (Java, C++,Python) to get data from the buffer.o A set of wrappers to work with the common CLAS12 bank format.
• Development of EvIO (persistent-transient ↔ transient-persistent) data converter services (i.e. EvIO to ROOT)
Thomas Jefferson National Accelerator Facility
Page 12
GEMC 2.0 GEANT 4 MONTE CARLO
12
➡ Automatic “true info”,V(T) signal, digitization➡ FADC ready➡ New banks, banks IO, automatic ROOT➡ Geometry based on Geometry Service➡ GEMC App: simplify installation
GEMC 2.0 1c
Thomas Jefferson National Accelerator Facility
Page 13
Introducing factories of factories
13
MYSQL
TEXT
GDML
CLARA (service library plugin)
<detector name="DC12" factory="MYSQL" variation="mestayer" run_number="23"/><detector name="LTCC" factory="TEXT" variation="rotated5deg" run_number="100"/><detector name="EC" factory="CLARA" variation="original" run_number="1"/>
Thomas Jefferson National Accelerator Facility
Page 14
GEMC Voltage Signal
14
Each step produces a V signal based on DB parameters
All signal are summed into a final V(t) shape
Negligible effect on performance
Example of 2.8 GeV/c p producing digital signal (FTOF 1a & 1b)
Thomas Jefferson National Accelerator Facility
Page 15
a) Generation 3 tracking (TRAC)
i. Written as Java service within ClaRA
ii. New design, algorithms & improved efficiency
iii. Ongoing code validation processes
iv. Used to analyze cosmic & test stand data & validate detector design changes
b) New Systems included in Reconstruction Chaini. CTOF & FTOF Reconstruction written as Java services within ClaRA
ii. FTCal Reconstruction written as Java service (ongoing standalone development)
c) Geometry Servicei. TOF & DC get geometry constants from Geometry Service (other
detector systems to be included soon)
ii. Simulation gets geometry constants from Geometry Service
d) Event Builderi. Links outputs of services connected together in reconstruction
application
ii. Output banks structure designed for output to ROOT
Advances in Analysis Software
1a
Thomas Jefferson National Accelerator Facility
Page 16
1. Advances in Monitoringa) Event Displays
i. Displays output of reconstruction as events are being processed (control room displays for detector monitoring)
ii. Does statistical histogramming of reconstruction output
b) Event histogramming servicesi. Occupancy plots
ii. Detector noise analysis, etc…
c) Service & Application Monitoring
i. Error handling
ii. Incident monitoring
2. Advances in Calibrationa) Calibration Development
i. TOF & EC Systems (using ROOT input)
ii. DC will use legacy algorithms and ROOT input
Advances in Monitoring & Calibration
1a
Thomas Jefferson National Accelerator Facility
Page 19
Software Profiling and Verification
Hot Spot Analysis
Thomas Jefferson National Accelerator Facility
Page 20
Software Profiling and Verification
trackingstudies
magnetic field studies
TRACSwimmer
GEMCtracking
BxByBz
Thomas Jefferson National Accelerator Facility
Page 21
Documentation/Workshops
• 3 Software Workshops • September 2012
– What Will Be Covered• The CLARA service-oriented framework• Service design and implementation• EVIO transient data/event format
• October 2012– We will walk through the steps needed to do the following:
• running gemc on the CUE machines to get digitized version of the generated events
• setup and run reconstruction code on a personal computer/laptop (Java required) or on the CUE machines
• visualize and perform some simple analysis on the output data
Thomas Jefferson National Accelerator Facility
Page 23
CLAS12 Data Distribution/Workflow Tool• Tagged File System:
• Tagged file system is needed to sort run files according to run conditions, targets, beam type and energy.
• Ability to add meta-data and run properties (such as beam current and total charge) to the run collections.
• CLARA Distributed Environment: • Grid-alternative computing environment for accessing the data from
anywhere and for pre-processing data to decrease network traffic. • Multi-node Dynamic Process Environments (DPE) for local network
distributed computing and data synchronization. • Data Mining Analysis Framework:
• Distributed experimental data framework, with search and discovery using TagFS.
• Analysis codes for specific experimental data set, including particle identification, momentum corrections and fiducial cuts.
Thomas Jefferson National Accelerator Facility
Page 24
Management
• Software workshops before scheduled CLAS collaboration meetings• Weekly software meetings (video-conferencing)• Mantis (bug reporting)• ClaRA framework supports multiple languages, to
accommodate/encourage user contributions.• Calibration and Commissioning committee is a collaboration wide body
that assumes responsibility to oversee the Clas12 software and computing activities.
• Software upgrades/modification as well as bug fixes are discussed using Mantis and e-mail list.
• Internal JLab reviews (for e.g. tracking algorithm discussions with the Hall-D group)
• Milestone changes to address critical issues: eg.• Data transfer through a shared memory• Minimize EvIO serialization/deserialization
2b-f
Thomas Jefferson National Accelerator Facility
Page 25
Addressing Previous Recommendations
• Stress tests
• Linear scaling with the number of cores
• 50 node test in progress
• Useability (see break-out sessions)
• Data-mining project uses ClaRA to ship data and run analyses at universities all over the world
• Simulation well advanced and used in proposals
• Generation-3 Tracking rebuilt and started to be used by detector groups
• EvIO to ROOT converter C++ service development
2a
Thomas Jefferson National Accelerator Facility
Page 26
Addressing Previous Recommendations
A series of scaling tests ramping up using the LQCD farm should be planned and undertaken.
A series of tests were run on the current batch farm (up to 32 hyper-threaded cores) to confirm ClaRA scaling and system robustness. Currently ramping to …cores. Full stress test planned for….
Seriously consider using ROOT as the file format in order to make use of thesteady advances in its I/O capabilities.
Considered. ROOT data convertor being developed, particularly for calibration services. That is, persistent data remains EVIO, but ROOT is an available file format.
The costs and sustainability of supporting two languages, relative to theadvantages, should be regularly assessed as the community of users grows, code development practices become clearer, the framework matures further, etc
Service language was chosen based on requirements. In fact a third language was added, python – specifically for the PWA analysis service (SciPy Fitter faster) Multi-lingual support has increased availability of programmers - eg ROOT based calibration services. The Geometry Service needed to be written in C++ for GEMC compatibility.
2a
Thomas Jefferson National Accelerator Facility
Page 27
Risks and Mitigation
• Communication latency• Has been resolved by introducing inter-node deployment of services with
shared memory and data caching in memory.
• Broad author and user pools• Proper management and administration; strict service canonization rules
• Workloads of different clients may introduce “pileups” on a single service
• Service and Cloud governance (e.g. service locking)
• Network security• Client authentication and message encryption
• Limited manpower• Interfaces (C++/Java) provide access to CLAS legacy code • Root data interface broadens programmer base for calibration code
2d
Thomas Jefferson National Accelerator Facility
Page 28
Summary
• 1a) Is Hall B making appropriate progress in developing simulation, calibration and analysis software?
• Yes. Simulation is in an advanced state since it was needed to validate detector design and performance. All detector subsystems are modeled, background simulation is realistic, geometry is aligned with reconstruction through a common service interface, and, finally, the package is easy to use.
• Calibration is at the advanced design stage, appropriate since it is the last element needed in the overall software chain. Hall B has an advantageous situation, in that the detector subsystems are well-understood by the subsystem managers, being very similar or in some cases, identical, to systems used in the previous CLAS detector.
• Analysis software has been designed from the bottom up, and the event reconstruction software written and tested for major subsystems: time of flight, calorimetry and charged particle tracking. Close cooperation among the core group of developers has produced a well-designed framework with a similar "look and feel" between the different systems, which should ease the tasks of debugging, maintenance and improvements over the years. Higher level analysis (event selection and binning, fiducial cuts, kinematic fitting, invariant mass reconstruction, etc.) has only just begun, but the core group are providing convenient tools for collaborative effort as demonstrated by some of the outside groups.
Thomas Jefferson National Accelerator Facility
Page 29
Summary cont
• Meeting previous milestones?• Yes. In a few cases we have re-prioritized effort (for example, placing more emphasis on
basics such as event format, object model definition, production of the core geometry and data-base services, while delaying the detailed instantiation of the individual detector calibrations which will be the last step in fine-tuned event reconstruction.
• Are the milestones adequate and clearly-defined?• Yes
• Is Hall B effectively utilizing collaboration manpower?• The majority of work in the past year (framework development, and writing of core
services) has been done largely by the core group. However, some of that core group are located remotely, demonstrating that this is not a hindrance to close collaboration. In addition, the software team has made a significant effort to engage the collaboration by holding a number of "hands-on" workshops, and by encouraging subsystem software groups to build their calibration GUI's on a ROOT-based framework. This should provide a sizeable group of people to work on the details of calibration over the next two years.
• Collaboration• CCDB shared between Halls B&D, EVIO developed by DAQ, used as persistent and
transient data, Farm Workflow, developed by Scientific Computing, in collaboration with Hall B, GEMC (B&D), Event Display (B&D), RootSpy (B&D), Tracking & Reconstruction algorithms (B&D)
Thomas Jefferson National Accelerator Facility
Page 31
Summary
ClaRA has advanced considerably in the past year through actual deployment
GEMC GEMC integrated with Geometry database
Several interactive workshops held on both ClaRA service development and ClaRA deployment to introduce the environment to the collaboration
ClaRA deployments and reconstruction chains implemented on a variety of collaboration farms.
Steady development of requisite services, in particular generation III tracking with Kalman Filter
Initial work on some calibration and monitoring services
Initial work on Histogramming/Statistical services
Initial work on service profiling and verification
Thomas Jefferson National Accelerator Facility
Page 34
ClaRA Components
DPE
C SDPE
C S
Orchestrator
Cloud Controller
•Service Bus (pub-sub server)•Registration (service registration)
•Discovery (service discovery)•Administration (keeps inventory of all running DPE’s and deployed services)
•Governing (provides information on services availability and distribution)
Platform
Cloud Control Node
Service BusPub-sub server
Administration •Service deployment
•Service removal
•Service recoveryMonitoring
DPE
Computing Node 1
Each node acts as a DPE.
All services are deployed and executed by threads inside the DPE process.
Global memory to share data between services.
• Designs and controls ClaRA applications
• Coordinate services execution and data flow.
• Usually run outside of the DPE.
• Deploy services to DPEs
• Link services together
• output of a service sent as input to its linked service.
Thomas Jefferson National Accelerator Facility
Page 35
Service Container• Group and manage services in a DPE
• Can be used as namespaces to separate services.o The same service engine can be deployed in
different containers in the same DPE.
• Handle service execution and its output.
• Service container presents a user engine as an SOA service (SaaS implementation).
En
gin
e in
terf
ace
Me
ssa
ge
pro
cess
ing
ServiceEngineService Engine
• The fundamental unit of ClaRA based application.
• Receives an input data in an envelope, and generates output data.o The data envelope is the same for all services.
• Implements ClaRA standard interfaceo A configure methodo An execute method.o Several description/identification methods.
• Must be thread-safe.o The same service engine can be executed in parallel
multiple times.
Thomas Jefferson National Accelerator Facility
Page 36
Service Communication
Transient Data Storage
Transient Data Storage
Service Bus
Service 1 Service 2 Service N
Service 1 Service 2 Service N
Java DPE
C++ DPE
Computing Node 1
Service Bus
Computing Node 2
Service Bus
Computing Node 1
Service Bus
Computing Node N
Thomas Jefferson National Accelerator Facility
Page 38
R
WAdministrative
Services
ClaRA Master DPE
Per
sist
ent
Sto
rag
e
AO
Executive Node
Farm Node N
S1 S2 SnS1
S1S2
S2Sn
Sn
Single Data-stream Application
Thomas Jefferson National Accelerator Facility
Page 39
R W
Administrative Services
ClaRA Master DPE
AO
Executive Node
Farm Node N
S1 S2 SnS1
S1S2
S2Sn
Sn
Multiple Data-stream Application
PersistentStorage
DS
PersistentStorage
Thomas Jefferson National Accelerator Facility
Page 41
Computing Model
Clas12 DetectorElectronics
Trigger
Slow Controls
ETOnline
Transient Data
Storage
PermanentData
Storage
OnlineEB
Services
Online MonitoringServices
EventVisualization
Services
OnlineCalibrationServices
CalibrationDatabase
ConditionsDatabase
Geometry Calibration
Services
Run Conditions
Services
Clo
ud
Co
ntr
ol
Ser
vice
Re
gis
tra
tion
Ser
vice
Co
ntr
ol
Online Farm
Online Application
Orchestrator
Clo
ud
Co
ntr
ol
Ser
vice
Re
gis
tra
tion
Ser
vice
Co
ntr
ol
Physics Data Processing Application Orchestrator
Geant 4
GEMCSimulation
EBServices
DSTHistogram
VisualizationServicesAnalysis
Services
CalibrationServices
Run Conditions
Services
Geometry Calibration Services
PermanentData Storage
PermanentData
Storage
Ser
vice
Co
ntr
olS
ervi
ce R
eg
istr
atio
n
Clo
ud
Co
ntr
ol
PermanentData Storage
CalibrationDatabase
ConditionsDatabase
Ser
vice
Co
ntr
olS
ervi
ce R
eg
istr
atio
n
Clo
ud
Co
ntr
ol
PermanentData Storage
CalibrationDatabase
ConditionsDatabase
Offline University Cloud 1
Offline University Cloud n
Clo
ud
Sch
ed
ule
r
Offline JLAB Farm
Thomas Jefferson National Accelerator Facility
Page 42
• Read EVIO events from input file.
• Events pass from service to service in the chain.o Services add more banks to the event.
• Write events to output file.
R S1 S2 SN W
Single Event Reconstruction
Thomas Jefferson National Accelerator Facility
Page 43
Multi-Core Reconstruction
R S1 S2 SN W
S1 S2 SN
S1 S2 SN
ODPE
Thomas Jefferson National Accelerator Facility
Page 47
Multi-Node Reconstruction
R
S1 SN
DO
S2S1 SNS2S1 SNS2
DPEn
S1 SNS2S1 SNS2S1 SNS2
DPE2
S1 SNS2S1 SNS2S1 SNS2
DPE1
WDPEio
MO
MO
Thomas Jefferson National Accelerator Facility
Page 50
Single Data-stream ApplicationClas12 Reconstruction: JLAB batch farm
0 2 4 6 8 10 120
0.5
1
1.5
2
2.5
f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652
Data Processing RateSingle datastream
Ethernet
Infiniband
RAM Disk IO
Linear (RAM Disk IO)
Number of processing nodes (32cores: 16 cores with hyperthreading )
kHz
Thomas Jefferson National Accelerator Facility
Page 51
0 2 4 6 8 10 12 14 16 18 20 220
1
2
3
4
5
f(x) = 0.2 xR² = 1
Data Processing RateMultiple datastreams
One data-file per processing nodeData-file contains 10K events
Number of processing nodes (32cores: 16 cores with hyperthreading )
kHz
Multiple Data-stream ApplicationClas12 Reconstruction: JLAB batch farm
Thomas Jefferson National Accelerator Facility
Page 53
Single Data-stream Application
0 2 4 6 8 10 120
0.5
1
1.5
2
2.5
f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652
Data Processing RateSingle datastream
Ethernet
Infiniband
RAM Disk IO
Linear (RAM Disk IO)
Number of processing nodes (32cores: 16 cores with hyperthreading )
kHz