thomas jefferson national accelerator facility page 1 12 gev upgrade software review jefferson lab...

53
Thomas Jefferson National Accelerator Facility Page 1 12 GeV Upgrade Software Review Jefferson Lab November 25-26, 2013 Software Project for Hall B 12 GeV Upgrade Status and Plans D. P. Weygand

Upload: julianna-bradford

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Thomas Jefferson National Accelerator Facility

Page 1

12 GeV Upgrade Software ReviewJefferson Lab

November 25-26, 2013

Software Project for Hall B 12 GeV Upgrade

Status and Plans

D. P. Weygand

Thomas Jefferson National Accelerator Facility

Page 2

Overview• CLAS12 Software Overview• Advances made in: 1a

– Framework– Simulations– Tracking– Calibration and Monitoring– Event Reconstruction– Data Processing

• Timelines and Milestones 1a/b– Milestones met and current timeline

• Useability 1c– Of framework (Data Mining Project), simulation (detector studies), and

reconstruction (detector performance analysis)• Software Profiling and Documentation 1d• Steps taken to address recommendations from previous review

2a• Risk Mitigation 2d• Summary

Thomas Jefferson National Accelerator Facility

Page 3

CLAS12 Offline SoftwareCOMPONENTS DESCRIPTION

GEMC Full-Geant4 based detector simulation

ced Event-level visualization

CLaRA SOA-based Physics Data Processing application development framework

DPE Physics Data Processing Environment in Java, C++, and Python

Service Containers Multi-threaded support

Application Orchestrators Cloud/ batch-farm support

Online Level-3 Data Processing ClaRA-based application deployment and operation on the online farm

Reconstruction Services

Tracking Charged-particle track reconstruction

FTOF/CTOF Time of flight reconstruction for PID

EC/PCAL Neutral particle identification

LTCC/HTCC K/pi separation

Forward Tagger Quasi-real photon tagger

PID Particle identification

Event Builder Entire event reconstruction

Calibration and Monitoring Services Detector calibration, monitoring of data processing, histogramming

Auxiliary Services Geometry service, magnetic field service

Calibration and Conditions Database Calibration and conditions database for on/offline constants

Data Analysis

DST Data summary tapes, data format for analysis

Data Mining Distributed data access

Thomas Jefferson National Accelerator Facility

Page 4

Computing Model and Architecture

• ClaRA (Clas12 Reconstruction & Analysis framework) is a multi-threaded analyses framework, based on a Service Oriented Architecture

• Physics application design/composition based on services• Services being developed 1a

– Charge particle tracking (central, forward)– EC reconstruction – TOF reconstruction– PID – HTCC– PCAL– Detector calibration– Event Builder– Histogram services– Database application

• Multilingual support 2b– Services can be written in C++, Java and Python

• Supports both traditional and cloud computing models 2f

• Single process as well as distributed application design modes• Centralized batch processing• Distributed cloud processing

Thomas Jefferson National Accelerator Facility

Page 5

Clas12 Event Reconstruction

R TOF PID

HTCC

CT

EC

EB

LTCC

PCAL

FT

KFKF

W

R/W: Reader/WriterCT: Central TrackingFT: Forward TrackingKF: Kalman FilterEC: Electromagnetic CalorimeterPCAL: PreShower CalorimeterTOF: Forward & Central Time-Of-FlightHTCC/LTCC: Threshold CerenkovEB: Event BuilderPID: Particle ID

Thomas Jefferson National Accelerator Facility

Page 6

Stress Tests

• Test using single data streams show that we can do online analysis to be able to analyze ~ 10% data-stream (2kHz)

• Test using multiple data-stream application scales with number of processing nodes (20 nodes used)

1b 2a 2f

0 2 4 6 8 10 120

0.5

1

1.5

2

2.5

f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652

Data Processing RateSingle datastream

Ethernet

Infiniband

RAM Disk IO

Linear (RAM Disk IO)

Number of processing nodes (32cores: 16 cores with hyperthreading )

kHz

Thomas Jefferson National Accelerator Facility

Page 7

R W

Administrative Services

ClaRA Master DPE

AO

Executive Node

Farm Node N

S1 S2 SnS1

S1S2

S2Sn

Sn

Multiple Data-stream Application

PersistentStorage

DS

PersistentStorage

Thomas Jefferson National Accelerator Facility

Page 8

0 2 4 6 8 10 12 14 16 18 20 220

1

2

3

4

5

f(x) = 0.2 xR² = 1

Data Processing RateMultiple datastreams

One data-file per processing nodeData-file contains 10K events

Number of processing nodes (32cores: 16 cores with hyperthreading )

kHz

Multiple Data-stream Application

Clas12 Reconstruction: JLAB batch farm

Thomas Jefferson National Accelerator Facility

Page 9

ClaRA Batch Processing WorkFlow

DPE: Data Processing EnvironmentSM: Shared MemoryDm: Data ManagerTm: Tape ManagerI/O: I/O ServicesS: Reconstruction Services

2f 3a/b

Thomas Jefferson National Accelerator Facility

Page 10

Jlab Workflow System

CA CLARA

COMD: E

INPUT: F

COMD: C

INPUT: D

Workflow System

script: A

INPUT: Bscript: C

INPUT:D*CLARAorchestrator

AugerPBS

Data Request

Workflow A Workflow C

Status Report

DPEJobs

1d

Thomas Jefferson National Accelerator Facility

Page 11

Advances within the Framework

• Batch Farm processing mode

Currently being integrated into large-scale workflow 2f 1d

• Service-based data-flow control application

• ClaRA Application Designer 1c

Graphical service composition

• Transient data streaming optimization EvIO 4.1 is the default event format for CLAS12.

o Data is just a byte buffer (avoid serialization).o Complete API (Java, C++,Python) to get data from the buffer.o A set of wrappers to work with the common CLAS12 bank format.

• Development of EvIO (persistent-transient ↔ transient-persistent) data converter services (i.e. EvIO to ROOT)

Thomas Jefferson National Accelerator Facility

Page 12

GEMC 2.0 GEANT 4 MONTE CARLO

12

➡ Automatic “true info”,V(T) signal, digitization➡ FADC ready➡ New banks, banks IO, automatic ROOT➡ Geometry based on Geometry Service➡ GEMC App: simplify installation

GEMC 2.0 1c

Thomas Jefferson National Accelerator Facility

Page 13

Introducing factories of factories

13

MYSQL

TEXT

GDML

CLARA (service library plugin)

<detector name="DC12" factory="MYSQL" variation="mestayer" run_number="23"/><detector name="LTCC" factory="TEXT" variation="rotated5deg" run_number="100"/><detector name="EC" factory="CLARA" variation="original" run_number="1"/>

Thomas Jefferson National Accelerator Facility

Page 14

GEMC Voltage Signal

14

Each step produces a V signal based on DB parameters

All signal are summed into a final V(t) shape

Negligible effect on performance

Example of 2.8 GeV/c p producing digital signal (FTOF 1a & 1b)

Thomas Jefferson National Accelerator Facility

Page 15

a) Generation 3 tracking (TRAC)

i. Written as Java service within ClaRA

ii. New design, algorithms & improved efficiency

iii. Ongoing code validation processes

iv. Used to analyze cosmic & test stand data & validate detector design changes

b) New Systems included in Reconstruction Chaini. CTOF & FTOF Reconstruction written as Java services within ClaRA

ii. FTCal Reconstruction written as Java service (ongoing standalone development)

c) Geometry Servicei. TOF & DC get geometry constants from Geometry Service (other

detector systems to be included soon)

ii. Simulation gets geometry constants from Geometry Service

d) Event Builderi. Links outputs of services connected together in reconstruction

application

ii. Output banks structure designed for output to ROOT

Advances in Analysis Software

1a

Thomas Jefferson National Accelerator Facility

Page 16

1. Advances in Monitoringa) Event Displays

i. Displays output of reconstruction as events are being processed (control room displays for detector monitoring)

ii. Does statistical histogramming of reconstruction output

b) Event histogramming servicesi. Occupancy plots

ii. Detector noise analysis, etc…

c) Service & Application Monitoring

i. Error handling

ii. Incident monitoring

2. Advances in Calibrationa) Calibration Development

i. TOF & EC Systems (using ROOT input)

ii. DC will use legacy algorithms and ROOT input

Advances in Monitoring & Calibration

1a

Thomas Jefferson National Accelerator Facility

Page 17

Previous Timeline/Milestones

Thomas Jefferson National Accelerator Facility

Page 18

New Timeline

• goes here…

Thomas Jefferson National Accelerator Facility

Page 19

Software Profiling and Verification

Hot Spot Analysis

Thomas Jefferson National Accelerator Facility

Page 20

Software Profiling and Verification

trackingstudies

magnetic field studies

TRACSwimmer

GEMCtracking

BxByBz

Thomas Jefferson National Accelerator Facility

Page 21

Documentation/Workshops

• 3 Software Workshops • September 2012

– What Will Be Covered• The CLARA service-oriented framework• Service design and implementation• EVIO transient data/event format

• October 2012– We will walk through the steps needed to do the following:

• running gemc on the CUE machines to get digitized version of the generated events

• setup and run reconstruction code on a personal computer/laptop (Java required) or on the CUE machines

• visualize and perform some simple analysis on the output data

Thomas Jefferson National Accelerator Facility

Page 22

Docmentation/Workshops

• February 2013

Thomas Jefferson National Accelerator Facility

Page 23

CLAS12 Data Distribution/Workflow Tool• Tagged File System:

• Tagged file system is needed to sort run files according to run conditions, targets, beam type and energy.

• Ability to add meta-data and run properties (such as beam current and total charge) to the run collections.

• CLARA Distributed Environment: • Grid-alternative computing environment for accessing the data from

anywhere and for pre-processing data to decrease network traffic. • Multi-node Dynamic Process Environments (DPE) for local network

distributed computing and data synchronization. • Data Mining Analysis Framework:

• Distributed experimental data framework, with search and discovery using TagFS.

• Analysis codes for specific experimental data set, including particle identification, momentum corrections and fiducial cuts.

Thomas Jefferson National Accelerator Facility

Page 24

Management

• Software workshops before scheduled CLAS collaboration meetings• Weekly software meetings (video-conferencing)• Mantis (bug reporting)• ClaRA framework supports multiple languages, to

accommodate/encourage user contributions.• Calibration and Commissioning committee is a collaboration wide body

that assumes responsibility to oversee the Clas12 software and computing activities.

• Software upgrades/modification as well as bug fixes are discussed using Mantis and e-mail list.

• Internal JLab reviews (for e.g. tracking algorithm discussions with the Hall-D group)

• Milestone changes to address critical issues: eg.• Data transfer through a shared memory• Minimize EvIO serialization/deserialization

2b-f

Thomas Jefferson National Accelerator Facility

Page 25

Addressing Previous Recommendations

• Stress tests

• Linear scaling with the number of cores

• 50 node test in progress

• Useability (see break-out sessions)

• Data-mining project uses ClaRA to ship data and run analyses at universities all over the world

• Simulation well advanced and used in proposals

• Generation-3 Tracking rebuilt and started to be used by detector groups

• EvIO to ROOT converter C++ service development

2a

Thomas Jefferson National Accelerator Facility

Page 26

Addressing Previous Recommendations

A series of scaling tests ramping up using the LQCD farm should be planned and undertaken.

A series of tests were run on the current batch farm (up to 32 hyper-threaded cores) to confirm ClaRA scaling and system robustness. Currently ramping to …cores. Full stress test planned for….

Seriously consider using ROOT as the file format in order to make use of thesteady advances in its I/O capabilities.

Considered. ROOT data convertor being developed, particularly for calibration services. That is, persistent data remains EVIO, but ROOT is an available file format.

The costs and sustainability of supporting two languages, relative to theadvantages, should be regularly assessed as the community of users grows, code development practices become clearer, the framework matures further, etc

Service language was chosen based on requirements. In fact a third language was added, python – specifically for the PWA analysis service (SciPy Fitter faster) Multi-lingual support has increased availability of programmers - eg ROOT based calibration services. The Geometry Service needed to be written in C++ for GEMC compatibility.

2a

Thomas Jefferson National Accelerator Facility

Page 27

Risks and Mitigation

• Communication latency• Has been resolved by introducing inter-node deployment of services with

shared memory and data caching in memory.

• Broad author and user pools• Proper management and administration; strict service canonization rules

• Workloads of different clients may introduce “pileups” on a single service

• Service and Cloud governance (e.g. service locking)

• Network security• Client authentication and message encryption

• Limited manpower• Interfaces (C++/Java) provide access to CLAS legacy code • Root data interface broadens programmer base for calibration code

2d

Thomas Jefferson National Accelerator Facility

Page 28

Summary

• 1a) Is Hall B making appropriate progress in developing simulation, calibration and analysis software?

• Yes. Simulation is in an advanced state since it was needed to validate detector design and performance. All detector subsystems are modeled, background simulation is realistic, geometry is aligned with reconstruction through a common service interface, and, finally, the package is easy to use.

• Calibration is at the advanced design stage, appropriate since it is the last element needed in the overall software chain. Hall B has an advantageous situation, in that the detector subsystems are well-understood by the subsystem managers, being very similar or in some cases, identical, to systems used in the previous CLAS detector.

• Analysis software has been designed from the bottom up, and the event reconstruction software written and tested for major subsystems: time of flight, calorimetry and charged particle tracking. Close cooperation among the core group of developers has produced a well-designed framework with a similar "look and feel" between the different systems, which should ease the tasks of debugging, maintenance and improvements over the years. Higher level analysis (event selection and binning, fiducial cuts, kinematic fitting, invariant mass reconstruction, etc.) has only just begun, but the core group are providing convenient tools for collaborative effort as demonstrated by some of the outside groups.

Thomas Jefferson National Accelerator Facility

Page 29

Summary cont

• Meeting previous milestones?• Yes. In a few cases we have re-prioritized effort (for example, placing more emphasis on

basics such as event format, object model definition, production of the core geometry and data-base services, while delaying the detailed instantiation of the individual detector calibrations which will be the last step in fine-tuned event reconstruction.

• Are the milestones adequate and clearly-defined?• Yes

• Is Hall B effectively utilizing collaboration manpower?• The majority of work in the past year (framework development, and writing of core

services) has been done largely by the core group. However, some of that core group are located remotely, demonstrating that this is not a hindrance to close collaboration. In addition, the software team has made a significant effort to engage the collaboration by holding a number of "hands-on" workshops, and by encouraging subsystem software groups to build their calibration GUI's on a ROOT-based framework. This should provide a sizeable group of people to work on the details of calibration over the next two years.

• Collaboration• CCDB shared between Halls B&D, EVIO developed by DAQ, used as persistent and

transient data, Farm Workflow, developed by Scientific Computing, in collaboration with Hall B, GEMC (B&D), Event Display (B&D), RootSpy (B&D), Tracking & Reconstruction algorithms (B&D)

Thomas Jefferson National Accelerator Facility

Page 30

BACK-UPS

Thomas Jefferson National Accelerator Facility

Page 31

Summary

ClaRA has advanced considerably in the past year through actual deployment

GEMC GEMC integrated with Geometry database

Several interactive workshops held on both ClaRA service development and ClaRA deployment to introduce the environment to the collaboration

ClaRA deployments and reconstruction chains implemented on a variety of collaboration farms.

Steady development of requisite services, in particular generation III tracking with Kalman Filter

Initial work on some calibration and monitoring services

Initial work on Histogramming/Statistical services

Initial work on service profiling and verification

Thomas Jefferson National Accelerator Facility

Page 32

User SOA Application Designer

Thomas Jefferson National Accelerator Facility

Page 33

Thomas Jefferson National Accelerator Facility

Page 34

ClaRA Components

DPE

C SDPE

C S

Orchestrator

Cloud Controller

•Service Bus (pub-sub server)•Registration (service registration)

•Discovery (service discovery)•Administration (keeps inventory of all running DPE’s and deployed services)

•Governing (provides information on services availability and distribution)

Platform

Cloud Control Node

Service BusPub-sub server

Administration •Service deployment

•Service removal

•Service recoveryMonitoring

DPE

Computing Node 1

Each node acts as a DPE.

All services are deployed and executed by threads inside the DPE process.

Global memory to share data between services.

• Designs and controls ClaRA applications

• Coordinate services execution and data flow.

• Usually run outside of the DPE.

• Deploy services to DPEs

• Link services together

• output of a service sent as input to its linked service.

Thomas Jefferson National Accelerator Facility

Page 35

Service Container• Group and manage services in a DPE

• Can be used as namespaces to separate services.o The same service engine can be deployed in

different containers in the same DPE.

• Handle service execution and its output.

• Service container presents a user engine as an SOA service (SaaS implementation).

En

gin

e in

terf

ace

Me

ssa

ge

pro

cess

ing

ServiceEngineService Engine

• The fundamental unit of ClaRA based application.

• Receives an input data in an envelope, and generates output data.o The data envelope is the same for all services.

• Implements ClaRA standard interfaceo A configure methodo An execute method.o Several description/identification methods.

• Must be thread-safe.o The same service engine can be executed in parallel

multiple times.

Thomas Jefferson National Accelerator Facility

Page 36

Service Communication

Transient Data Storage

Transient Data Storage

Service Bus

Service 1 Service 2 Service N

Service 1 Service 2 Service N

Java DPE

C++ DPE

Computing Node 1

Service Bus

Computing Node 2

Service Bus

Computing Node 1

Service Bus

Computing Node N

Thomas Jefferson National Accelerator Facility

Page 37

Transient Data Envelope

Thomas Jefferson National Accelerator Facility

Page 38

R

WAdministrative

Services

ClaRA Master DPE

Per

sist

ent

Sto

rag

e

AO

Executive Node

Farm Node N

S1 S2 SnS1

S1S2

S2Sn

Sn

Single Data-stream Application

Thomas Jefferson National Accelerator Facility

Page 39

R W

Administrative Services

ClaRA Master DPE

AO

Executive Node

Farm Node N

S1 S2 SnS1

S1S2

S2Sn

Sn

Multiple Data-stream Application

PersistentStorage

DS

PersistentStorage

Thomas Jefferson National Accelerator Facility

Page 40

Application Graphical Designer

Thomas Jefferson National Accelerator Facility

Page 41

Computing Model

Clas12 DetectorElectronics

Trigger

Slow Controls

ETOnline

Transient Data

Storage

PermanentData

Storage

OnlineEB

Services

Online MonitoringServices

EventVisualization

Services

OnlineCalibrationServices

CalibrationDatabase

ConditionsDatabase

Geometry Calibration

Services

Run Conditions

Services

Clo

ud

Co

ntr

ol

Ser

vice

Re

gis

tra

tion

Ser

vice

Co

ntr

ol

Online Farm

Online Application

Orchestrator

Clo

ud

Co

ntr

ol

Ser

vice

Re

gis

tra

tion

Ser

vice

Co

ntr

ol

Physics Data Processing Application Orchestrator

Geant 4

GEMCSimulation

EBServices

DSTHistogram

VisualizationServicesAnalysis

Services

CalibrationServices

Run Conditions

Services

Geometry Calibration Services

PermanentData Storage

PermanentData

Storage

Ser

vice

Co

ntr

olS

ervi

ce R

eg

istr

atio

n

Clo

ud

Co

ntr

ol

PermanentData Storage

CalibrationDatabase

ConditionsDatabase

Ser

vice

Co

ntr

olS

ervi

ce R

eg

istr

atio

n

Clo

ud

Co

ntr

ol

PermanentData Storage

CalibrationDatabase

ConditionsDatabase

Offline University Cloud 1

Offline University Cloud n

Clo

ud

Sch

ed

ule

r

Offline JLAB Farm

Thomas Jefferson National Accelerator Facility

Page 42

• Read EVIO events from input file.

• Events pass from service to service in the chain.o Services add more banks to the event.

• Write events to output file.

R S1 S2 SN W

Single Event Reconstruction

Thomas Jefferson National Accelerator Facility

Page 43

Multi-Core Reconstruction

R S1 S2 SN W

S1 S2 SN

S1 S2 SN

ODPE

Thomas Jefferson National Accelerator Facility

Page 44

Multi-Core Reconstruction

Thomas Jefferson National Accelerator Facility

Page 45

Multi-Core Reconstruction

Thomas Jefferson National Accelerator Facility

Page 46

Multi-Core Reconstruction

Thomas Jefferson National Accelerator Facility

Page 47

Multi-Node Reconstruction

R

S1 SN

DO

S2S1 SNS2S1 SNS2

DPEn

S1 SNS2S1 SNS2S1 SNS2

DPE2

S1 SNS2S1 SNS2S1 SNS2

DPE1

WDPEio

MO

MO

Thomas Jefferson National Accelerator Facility

Page 48

Multi-Node Reconstruction

Thomas Jefferson National Accelerator Facility

Page 49

Batch Deployment

Thomas Jefferson National Accelerator Facility

Page 50

Single Data-stream ApplicationClas12 Reconstruction: JLAB batch farm

0 2 4 6 8 10 120

0.5

1

1.5

2

2.5

f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652

Data Processing RateSingle datastream

Ethernet

Infiniband

RAM Disk IO

Linear (RAM Disk IO)

Number of processing nodes (32cores: 16 cores with hyperthreading )

kHz

Thomas Jefferson National Accelerator Facility

Page 51

0 2 4 6 8 10 12 14 16 18 20 220

1

2

3

4

5

f(x) = 0.2 xR² = 1

Data Processing RateMultiple datastreams

One data-file per processing nodeData-file contains 10K events

Number of processing nodes (32cores: 16 cores with hyperthreading )

kHz

Multiple Data-stream ApplicationClas12 Reconstruction: JLAB batch farm

Thomas Jefferson National Accelerator Facility

Page 52

Previous Timeline/Milestones 1b

Thomas Jefferson National Accelerator Facility

Page 53

Single Data-stream Application

0 2 4 6 8 10 120

0.5

1

1.5

2

2.5

f(x) = 0.192970075078513 x + 0.114050382436881R² = 0.990129637380652

Data Processing RateSingle datastream

Ethernet

Infiniband

RAM Disk IO

Linear (RAM Disk IO)

Number of processing nodes (32cores: 16 cores with hyperthreading )

kHz