first operational experience with the cms run control system

31
First operational experience First operational experience with the with the CMS Run Control System CMS Run Control System Hannes Sakulin, CERN/PH Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group on behalf of the CMS DAQ group 17 17 th th IEEE Real Time Conference, 24-28 May 2010, IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal Lisbon, Portugal

Upload: judd

Post on 02-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

First operational experience with the CMS Run Control System. Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group. 17 th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal. The Compact Muon Solenoid Experiment. Drift-Tube chambers. Iron Yoke. Resistive Plate Chambers. LHC - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: First operational experience with the  CMS Run Control System

First operational experience with the First operational experience with the CMS Run Control SystemCMS Run Control SystemHannes Sakulin, CERN/PHHannes Sakulin, CERN/PHon behalf of the CMS DAQ groupon behalf of the CMS DAQ group

1717thth IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal

Page 2: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH2

The Compact Muon Solenoid ExperimentDrift-Tube chambersDrift-Tube chambers

Cathode Strip Cathode Strip ChambersChambers

Resistive Plate ChambersResistive Plate Chambers

Iron YokeIron Yoke4 T Superconducting Coil4 T Superconducting Coil

TrackersTrackers•Silicon StripSilicon Strip•Silcon PixelSilcon Pixel

Electromagnetic Electromagnetic CalorimeterCalorimeter

HadronicHadronicCalorimeterCalorimeter

LHC p-p collisions, ECM=14 TeV (2010: 7 TeV), heavy ion Bunch crossing frequency 40 MHzCMS Multi-purpose detector, broad physics programme 55 million readout channels

Page 3: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH3

CMS Trigger and DAQ design First Level Trigger

(hardware) up to 100 kHz

Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent

event builder / filter slices

High level trigger running on filter farm ~700 PCs ~6000 cores

In total around 10000 applications to control

First Level Trigger (hardware) up to 100 kHz

Central DAQ builds events at 100 kHz, 100 GB/s 2 stages 8 independent

event builder / filter slices

High level trigger running on filter farm ~700 PCs ~6000 cores

In total around 10000 applications to control

Filter farm

FrontendReadout Links

Page 4: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH4

CMS Control Systems

Front-end Drivers, First Level Trigger

Central DAQ& High Level Trigger Farm

DAQTrigger

Slice Slice

ECALTracker …

DCS

Trigger SupervisorXDAQC++

Run Control SystemJava, Web Technologies

Front-end Electronics

datadata

Page 5: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH5

CMS Control Systems

Front-end Drivers, First Level Trigger

Central DAQ& High Level Trigger Farm

DAQTrigger

Slice Slice

ECALTracker …

Tracker ECAL

Detector Control System

DCS

Run Control SystemJava, Web Technologies

Low voltageHigh voltage Gas, Magnet

Front-end Electronics

datadata

PVSS (Siemens ETM)SMI (State Management Interface )

Trigger SupervisorXDAQC++

Page 6: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH6

CMS Run Control System

XDAQ Application

Function ManagerNode in the Run Control Tree defines a State Machine & parametersUser function managers dynamically loaded into the web application

Run Control World – Java, Web TechnologiesDefines the control structure

XDAQ World – C++, XML, SOAPXDAQ applications control hardware and data flow

XDAQ is the framework of CMS online softwareIt provides Hardware Access, Transport Protocols, Services etc.

~10000 applications to control

data

HTML, CSS, JavaScript, AJAXGUI in a web browser

Run Control Web ApplicationApache Tomcat Servlet ContainerJava Server Pages, Tag Libraries,Web Services (WSDL, Axis, SOAP)

Page 7: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH7

ChildResource Proxy

ChildResource Proxy

EventProcessor

Child Resource Proxy – Run Control

StateMachineEngine

StateMachine

Definition

Event Handler

Event Handler

ParameterSet

Web Service

from/to Parent Function Manager / GUI

AsynchronousNotifications

to / from Child Function Manager

Web serviceL

ifecy

cle

+

Co

nfig

ura

tion

Co

mm

an

dP

ara

me

ter

Mo

nito

r

Servlet / Web Service

JobControl

Ev

State MachineCallback

LifecycleCommandParameter

Child Resource Proxy - XDAQ

Function Manager FrameworkState, ErrorsParameters

YY

C. Resource Proxy – PSX

Servlet

to / from DetectorControl System

Custom code

FunctionManager

XX

Frame-workcode

Legend

Page 8: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH8

ChildResource Proxy

ChildResource Proxy

EventProcessor

Child Resource Proxy – Run Control

StateMachineEngine

StateMachine

Definition

Event Handler

Event Handler

ParameterSet

Web Service

from/to Parent Function Manager / GUI

Resource Service DB

Run InfoDB

AsynchronousNotifications

to / from Child Function Manager

Web serviceL

ifecy

cle

+

Co

nfig

ura

tion

Co

mm

an

dP

ara

me

ter

Mo

nito

r

Servlet / Web Service

Logs

XDAQMonitoring& Alarming

System

DAQ StructureDB

JobControl

Ev

State MachineCallback

LifecycleCommandParameter

Child Resource Proxy - XDAQ

Function Manager Framework

LogCollector

State, ErrorsParameters

YY

C. Resource Proxy – PSX

Servlet

to / from DetectorControl System

Custom code

FunctionManager

XX

Frame-workcode

ConditionsConfigurationFM + XDAQ

Legend

Monitoring

Errors

Page 9: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH9

Entire DAQ System Structure is Configurable

Job ControlService

Database

XML XML

Control structure• Function Managers to load (URL)• Parameters• Child nodes

Configuration of XDAQ Executives (XML)• libraries to be loaded• applications (e.g. builder unit, filter unit)

& parameters• network connections• collaborating applications

Control structure• Function Managers to load (URL)• Parameters• Child nodes

Configuration of XDAQ Executives (XML)• libraries to be loaded• applications (e.g. builder unit, filter unit)

& parameters• network connections• collaborating applications

ResourceService

API

Flow of configuration data

SOAP

High-level tools to generate configurations

versioning

Page 10: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH10

CMS Control Tree

Level-0Level-0

DAQ

TTS FB

Trigger

FEC Slice 0 Slice 7

FB RB HLT

ECALTracker

FED …

Level-0: Control and parameterization of Run

Level-1: Common state machine andParameters

Level-2:

GUI (Web browser)

DT

Level-n:

Sub-system specific…

RPC …

Sub-system Run Control developedby sub-system teams

Framework and Top-Level Run Controldeveloped by central team

Frontend controller

Frontend driver

TriggerThrottlingSystem

FEDBuilder

ReadoutBuilder

HighLevel

Trigger

Page 11: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH11

RCMS Level-1 State Machine (simplified)

Halted

Created

Configured

Running

Paused

Pre-Configured

Error

CreationLoad & start Level-1 Function Managers

InitializationStart further levels of function managersStart all XDAQ processes on the cluster

New: Pre-Configuration (trigger only – few seconds)Sets up the clock and periodic timing signals

ConfigurationLoad configuration from databaseConfigures hardware and applications

Start run

Pause / ResumePauses / resumes the trigger (and trackers which may need to change settings)

Stop run

Halt

Page 12: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH12

Top-Level Run Control (Level-0)

Central point of control Global State Machine

Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local)

Central point of control Global State Machine

Level-0 allows to parameterize configuration Sub-system Run Key (e.g. level of zero suppression) First Level Trigger Key / High Level Trigger Key Clock source (LHC / local)

Page 13: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH13

Masking of components

Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking)

Connection to readout (SLINK) Connection to Trigger Throttling System

Mask out DAQ slices ( = 1/8 of central DAQ)

Level-0 allows to mask out components Remove/add sub-systems from control and readout Remove add detector partitions Remove/add individual Frontend-Drivers (masking)

Connection to readout (SLINK) Connection to Trigger Throttling System

Mask out DAQ slices ( = 1/8 of central DAQ)

Page 14: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH14

Commissioning and First Operation with the LHC

Page 15: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH15

Commissioning and First Operation Independent parallel commissioning of sub-detectors

Mini DAQ setups allow for standalone operation

Page 16: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH16

Mini DAQ (“partitioning”)

Dedicated small DAQ setups for most sub-systems

Low bandwidth but sufficient for most tests

Mini DAQ may be used in parallel to the Global Runs

Dedicated small DAQ setups for most sub-systems

Low bandwidth but sufficient for most tests

Mini DAQ may be used in parallel to the Global Runs

Level-0Level-0

GlobalDAQ

GlobalTrigger

Slice 0 Slice 7

Tracker

Level-0Level-0

MiniDAQECALLTC… DTLocal Trigger Controller(or Global Trigger)

Global Run MiniDAQ Run(heavily used in commissioning phase)

Page 17: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH17

Commissioning and First Operation Independent parallel commissioning of sub-detectors

Mini DAQ setups allow for standalone operation Run start time

End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)

Page 18: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH18

Optimization of run startup time Globally

Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements

Page 19: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH19

Optimization of run startup time Globally

Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements

Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications

combine commands and parameters into single message

New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write

Page 20: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH20

Optimization of run startup time Globally

Optimized the global state model (pre-configuration) Provided tools for parallelization of user code (Parameter handling) Sub-system specific performance improvements

Central DAQ Developed tool to analyze log files and plot timelines of all operations Distributed central DAQ control over 5 Apache Tomcat servers (previously 1) Reduced message traffic between Run Control and XDAQ applications

combine commands and parameters into single message

New startup method for High Level Trigger processes on multi-core machines Initialize and Configure mother process, then fork child processes Reduced memory footprint due to copy-on-write

Page 21: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH21

Run Start timing (May 2010)

Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start

Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start) Configuration time now dominated by frontend configuration (Tracker) Pause/Resume 7x faster than Stop/Start

sub-

syst

em

time (seconds)

Page 22: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH22

Commissioning and First Operation Independent parallel commissioning of sub-detectors

Mini DAQ setups allow for standalone operation Run start time

End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute

Initially some stability issues Problems solved by debugging user code (thread leaks)

Page 23: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH23

Commissioning and First Operation Independent parallel commissioning of sub-detectors

Mini DAQ setups allow for standalone operation Run start time

End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute

Initially some stability issues Problems solved by debugging user code (thread leaks)

Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only)

Page 24: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH24

Commissioning and First Operation Independent parallel commissioning of sub-detectors

Mini DAQ setups allow for standalone operation Run start time

End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start) Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute

Initially some stability issues Problems solved by debugging user code (thread leaks)

Recovery from sub-system faults Control of individual sub-systems from top-level control node Fast masking / unmasking of components (partial re-configuration, only)

Operator efficiency Operation is complex

Subsystem inter-dependencies when configuring partially Dependencies on internal & external parameters Procedures to follow (Clock change)

Operators are no longer DAQ experts but colleagues from the entire collaboration Built-in cross checks to guide the operator

Page 25: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH25

Built-in cross-checks

Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if

A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change

Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source

changed or LHC has been unstable

Built-in cross-checks guide the shifter Indicate sub-systems to re-configure if

A parameter is changed in the GUI A sub-system / FED is added/removed External parameters change

Enforce correct order of re-configuration Enforce re-configuration of CMS if clock source

changed or LHC has been unstable

Improved operator efficiency

Page 26: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH26

Operation with the LHC

Cosmic run1. Bring the detector into the desired state (Detector Control system)

2. Start Data Acquisition (Run Control System) LHC

Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready

Cosmic run1. Bring the detector into the desired state (Detector Control system)

2. Start Data Acquisition (Run Control System) LHC

Detector state and DAQ state depend on the LHC Want to keep DAQ going before beams are stable to ensure that we are ready

LHC

dip

ole

cu

rren

t

LHC clock stable

Ramp:clock variationsmay unlock some links in the trigger

Tracking detector high voltage only ramped up whenbeams are stable(detector safety)

Page 27: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH27

Integration with DCS & automatic actions

In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states

Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents)

Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run

In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states

Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers) Trigger masks channels while LHC is ramping Silicon-Strip Tracker masks payload when running with HV off (noise) Silicon-Pixel Tracker reduce gains when running with HV off (high currents)

Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run

Level-0Level-0

DAQTracker …

DCSDCS

Tracker ECAL

Detector Control System

DCS

Run Control System

…0

PSX

LHC

PVSS SOAP

eXchange

XDAQservice

Page 28: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH28

Automatic actions

LHC

dip

ole

cu

rren

t

LHC clock stable

startramp start

Masksensitivetrigger

channels

ramp doneUnmasksensitivetrigger

channels

Tracker HV onEnable payloadlower thresholds

log HV state in data

Ramp up tracker HV

stop

Ramp down tracker HV

CMS run:Tracker HV offDisable payloadraise thresholds

log HV state in data

Automatic actions

Page 29: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH29

Observations Standardizing the experiment’s software is important for

long-term maintenance Almost successful considering the size of the collaboration Run Control Framework was available early in the development of the

experiment’s software (2003) Adopted by all sub-systems But some sub-systems built their own framework, underneath

Ease-of-use becomes more and more important Run Control / DAQ is now operated by members of the entire CMS

collaboration Running with high life-time:

> 95 % so far for stable-beam periods in 2010

Page 30: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH30

Observations – Web Technology Operations

Typical advantages of a web application: multiple clients, remote login Stability of the server (Apache Tomcat + Run Control Web Application)

very good: running for weeks Stability of the GUI depends on third-part products (browser)

Behavior changes from one release to the next Not a big problem - GUI can be restarted without affecting the run

Development Knowledge of Java and the Run Control Framework sufficient for basic

function managers Web-based GUI & web technologies handled by framework

Development of complex GUIs such as the top-level control node more difficult

Many technologies need to be mastered Modern web toolkits not yet used by Run Control

Page 31: First operational experience with the  CMS Run Control System

IEEE Real Time 2010, 27 May 2010 H. Sakulin / CERN PH31

Summary & Outlook CMS Run Control System is based on Java & Web Technologies Good stability Top-Level Control node optimized for efficiency

Flexible operation of individual sub-systems Built-in cross-checks to guide the operator Automatic actions

triggered by detector and LHC state High CMS data-taking efficiency

life-time > 95%

Next developments Further improve fault tolerance Automatic recovery procedures Auto Pilot candidate event