chep `03 march 24, 2003 vincenzo innocente cern/ep cms data analysis: present status, future...
TRANSCRIPT
CHEP `03CHEP `03
March 24, 2003 Vincenzo Innocente CERN/EPhttp://cmsdoc.cern.ch/cmsoo/cmsoo.html
CMS Data Analysis:CMS Data Analysis:Present Status, Future StrategiesPresent Status, Future Strategies
Vincenzo InnocenteVincenzo Innocente
CERN/EP
3March 24, 2003 Vincenzo Innocente CERN/EP
AnalysisAnalysis Analysis is not to use the tool to plot an histogram, but the full chain from accessing
event data up to producing the final plot for publication Analysis is an iterative process:
Reduce data samples to more interesting subsets (selection) Compute higher level information Calculate statistical entities
Several steps: Run analysis job on full dataset (few times) Use interactive analysis tool to run several times on reduced dataset and make plots
Still in the early stage of defining an Analysis Model Today we work with raw data Reconstruction and analysis are mixed up
– (analysis and debugging are mixed up!) Software development, production and analysis in parallel No clear concept of high level persistent objects (DST) Each physics group has its own analysis package and “standard ntuple”
CMS is a laboratory for experimenting analysis solutions
4March 24, 2003 Vincenzo Innocente CERN/EP
Getting ready for April `07 Getting ready for April `07 CMS is engaged in an aggressive program of “data challenges”
of increasing complexity.
Each is focus on a given aspect, all encompass the whole data analysis process: Simulation, reconstruction, statistical analysis Organized production, end-user batch job, interactive work
Past: Data Challenge `02 Focus on High Level Trigger studies
Present: Data Challenge `04 Focus on “real-time” mission critical tasks
Future: Data Challenge `06 Focus on distributed physics analysis
5March 24, 2003 Vincenzo Innocente CERN/EP
HLT Production 2002HLT Production 2002 Focused on High Level Trigger studies
6 M events = 150 Physics channels 19000 files = 500 Event Collections = 20 TB NoPU: 2.5M, 2x1033PU:4.4M, 1034PU: 3.8M, filter: 2.9M 100 000 jobs, 45 years CPU (wall-clock) 11 Regional Centers
– > 20 sites in USA, Europe, Russia– ~ 1000 CPUs
More than 10 TB traveled on the WAN More than 100 physics involved in the final analysis
GEANT3, Objectivity, Paw, Root CMS Object Reconstruction & Analysis Framework (COBRA) and applications
(ORCA)
Successful validation of CMS High Level Trigger AlgorithmsRejection factors, computing performance, reconstruction-framework
6March 24, 2003 Vincenzo Innocente CERN/EP
Data challenge 2004Data challenge 2004
DC04, to be completed in April 2004 Reconstruct 50 million events Copes with 25 Hz at 2×1033 cms-2 s-1 for 1 month These are supposed to be events in the Tier-0 center, i.e. events
passing the HLT– From the computing point of view the test is the same if these events
are simple minimum bias– This is a great opportunity to reconstruct events which can be used for
full analysis (Physics-TDR) Define and validate datasets for analysis
– Identify reconstruction and analysis objects each group would like to have for the full analysis
– Develop selection algorithms necessary to obtain the required sample
Prepare for “mission critical” analysis test event model– Look at calibration and alignment
Physics and computing validation of Geant4 detector simulation
7March 24, 2003 Vincenzo Innocente CERN/EP
How data analysis beginsHow data analysis begins
The result of the reconstruction will be saved along with the raw data in an Object database
Monitoring, calibrationHLTFU Filter UnitSU Server UnitPU Processing Unit
On
line
Off
line
Expresslines
Reconstruction, reprocessing, analysis
latency: (minutes – hours)
8March 24, 2003 Vincenzo Innocente CERN/EP
Data Challenge 2004Data Challenge 2004
… … … …
DigitizationORCA
Digis:raw data
bx
MB… … … … MC ntuples
Event generationPYTHIA
b/ e/ JetMet
AnalysisIguana/
Root/PAW
Ntuples:MC info,tracks,
etc
DST strippingORCA
… … … …
… … … …
Reconstruction,L1, HLTORCA
DST
Detector simulationOSCAR
Detector HitsMB… … … …
Calibration
9March 24, 2003 Vincenzo Innocente CERN/EP
Jets
CaloClusters TkTracks
CaloRecHits TkRecHits
CaloDataFrames TkDigis
JetReconstructor
< cut
r < rcut
Calib-AAlign-C
TkHitsRandom Nr.
CaloHits
Reconstruction
DAQ or
Simulation
High granularity “DAG”High granularity “DAG”Calibrations and detail detector and physics studies require access to few objects per event.
These studies will need also to access to “conditions” data associated to these objects.
Access pattern to the very same object may be very different for different use cases: a flexible definition of “datasets” (associated to use-cases) is required.
10
March 24, 2003 Vincenzo Innocente CERN/EP
Analysis EnvironmentsAnalysis EnvironmentsReal Time Event Filtering and Monitoring
Data driven pipeline High reliability
Pre-emptive Simulation, Reconstruction and Event Classification Massive parallel batch-sequential process Excellent error recovery and rollback mechanisms Excellent scheduling and bookkeeping systems
Interactive Statistical Analysis Rapid Application Development environment Excellent visualization and browsing tools Human “readable” navigation
11
March 24, 2003 Vincenzo Innocente CERN/EP
Three Computing Environments: Three Computing Environments: Different ChallengesDifferent Challenges
Centralized quasi-online processing Keep-up with the rate Validate and distribute data efficiently
Distributed organized processing Automatization
Interactive chaotic analysis Efficient access to data and “Metadata” Management of “private” data Rapid Application Development
12
March 24, 2003 Vincenzo Innocente CERN/EP
The Ultimate Challenge:The Ultimate Challenge:A Coherent Analysis EnvironmentA Coherent Analysis Environment
Beyond the interactive analysis tool (User point of view) Data analysis & presentation: N-tuples, histograms, fitting, plotting, …
A great range of other activities with fuzzy boundaries (Developer point of view) Batch Interactive from “pointy-clicky” to Emacs-like power tool to scripting Setting up configuration management tools, application frameworks and reconstruction
packages Data store operations: Replicating entire data stores; Copying runs, events, event parts
between stores; Not just copying but also doing something more complicated—filtering, reconstruction, analysis, …
Browsing data stores down to object detail level 2D and 3D visualisation Moving code across final analysis, reconstruction and triggers
Today this involves (too) many tools
13
March 24, 2003 Vincenzo Innocente CERN/EP
Federation
wizards
Detector/EventDisplay
Data Browser
Analysis jobwizards
Generic analysis Tools
ORCAORCA
FAMOSFAMOS
LCGLCGtoolstools
GRIDGRID
OSCAROSCARCOBRACOBRA
DistributedData Store
& ComputingInfrastructure
CMSCMStoolstools
Architecture OverviewArchitecture Overview
ConsistentUser Interface
Coherent set of basic tools and mechanisms
Software development and installation
14
March 24, 2003 Vincenzo Innocente CERN/EP
Simulation, Reconstruction & Analysis Simulation, Reconstruction & Analysis Software SystemSoftware System
SpecificFramework
Object
Persistency Geant3/4 CLHEP AnalysisTools
C++ standard library
Extension toolkit
Reconstruction
Algorithms
Data
Monitoring
Event
Filter
Physics
Analysis
CalibrationObjects Event Objects
ConfigurationObjects
Generic Application Framework
Physics modules
adapters and extensions
BasicServices
Grid-Aware Data-Products
Grid-enabled
Application
Framework
Uploadable on the Grid
LC
G
16
March 24, 2003 Vincenzo Innocente CERN/EP
VariedVaried components and data flows components and data flowsOne PortalOne Portal
Production system and data repositories
ORCA analysis farm(s) (or distributed `farm’ using grid queues)
RDBMS based data
warehouse(s)
PIAF/Proof/..type analysis
farm(s)
Local disk
User
TAGs/AODsdata flow
Physics Query flow
Tier 1/2
Tier 0/1/2
Tier 3/4/5
Productiondata flow
TAG and AOD extraction/conversion/transport services
Data extractionWeb service(s)
Local analysis tool: Iguana/ROOT/… Web browser
Query Web service(s)Tool plugin
module
17
March 24, 2003 Vincenzo Innocente CERN/EP
CLARENS: a Portal to the GridCLARENS: a Portal to the GridGrid-enabling the working environment for physicists' data analysisClarens consists of a server communicating with various clients via the commodity XML-RPC protocol. This ensures implementation independence.The server will provide a remote API to Grid tools:
Client
RPC
Web Server
Clarens
Service
http
/htt
ps
The Virtual Data Toolkit: Object collection accessData movement between Tier centres using GSI-FTPCMS analysis software (ORCA/COBRA),Security services provided by the Grid (GSI)No Globus needed on client side, only certificate
Current prototype is running on the Caltech proto-Tier2
19
March 24, 2003 Vincenzo Innocente CERN/EP
Example of analysis on the gridExample of analysis on the grid
Web Server
Clarens
Service
Service
Service
Service
Remote batch service: resource allocations,
control, monitoring
Local analysis Environment:Data cachebrowser, presenterResource broker?
Remote web service: act as gateway between users and remote facility
20
March 24, 2003 Vincenzo Innocente CERN/EP
SummarySummarySuccess of analysis software will be measured by the ability to
provide at the same time a simple, coherent and stable view to the physicists retaining the flexibility required to achieve the maximal computing efficiency
CMS is responding to this challenge developing an analysis software architecture based on a layered structure a consistent interface to the physicist
– Customizable
– Implemented in many flavors (Qt, python, root, web-browser) A flexible application framework
– Mainly responsible of managing event-data with high-granularity A set of back-end services
– Specialized for different use-cases and computing environments