chep `03 march 24, 2003 vincenzo innocente cern/ep cms data analysis: present status, future...

17
CHEP `03 CHEP `03 March 24, 2003 Vincenzo Innocente CERN/EP http://cmsdoc.cern.ch/cmsoo/cmsoo.html CMS Data Analysis: CMS Data Analysis: Present Status, Future Strategies Present Status, Future Strategies Vincenzo Innocente Vincenzo Innocente CERN/EP

Upload: henry-sherman

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

CHEP `03CHEP `03

March 24, 2003 Vincenzo Innocente CERN/EPhttp://cmsdoc.cern.ch/cmsoo/cmsoo.html

CMS Data Analysis:CMS Data Analysis:Present Status, Future StrategiesPresent Status, Future Strategies

Vincenzo InnocenteVincenzo Innocente

CERN/EP

Page 2: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

3March 24, 2003 Vincenzo Innocente CERN/EP

AnalysisAnalysis Analysis is not to use the tool to plot an histogram, but the full chain from accessing

event data up to producing the final plot for publication Analysis is an iterative process:

Reduce data samples to more interesting subsets (selection) Compute higher level information Calculate statistical entities

Several steps: Run analysis job on full dataset (few times) Use interactive analysis tool to run several times on reduced dataset and make plots

Still in the early stage of defining an Analysis Model Today we work with raw data Reconstruction and analysis are mixed up

– (analysis and debugging are mixed up!) Software development, production and analysis in parallel No clear concept of high level persistent objects (DST) Each physics group has its own analysis package and “standard ntuple”

CMS is a laboratory for experimenting analysis solutions

Page 3: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

4March 24, 2003 Vincenzo Innocente CERN/EP

Getting ready for April `07 Getting ready for April `07 CMS is engaged in an aggressive program of “data challenges”

of increasing complexity.

Each is focus on a given aspect, all encompass the whole data analysis process: Simulation, reconstruction, statistical analysis Organized production, end-user batch job, interactive work

Past: Data Challenge `02 Focus on High Level Trigger studies

Present: Data Challenge `04 Focus on “real-time” mission critical tasks

Future: Data Challenge `06 Focus on distributed physics analysis

Page 4: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

5March 24, 2003 Vincenzo Innocente CERN/EP

HLT Production 2002HLT Production 2002 Focused on High Level Trigger studies

6 M events = 150 Physics channels 19000 files = 500 Event Collections = 20 TB NoPU: 2.5M, 2x1033PU:4.4M, 1034PU: 3.8M, filter: 2.9M 100 000 jobs, 45 years CPU (wall-clock) 11 Regional Centers

– > 20 sites in USA, Europe, Russia– ~ 1000 CPUs

More than 10 TB traveled on the WAN More than 100 physics involved in the final analysis

GEANT3, Objectivity, Paw, Root CMS Object Reconstruction & Analysis Framework (COBRA) and applications

(ORCA)

Successful validation of CMS High Level Trigger AlgorithmsRejection factors, computing performance, reconstruction-framework

Page 5: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

6March 24, 2003 Vincenzo Innocente CERN/EP

Data challenge 2004Data challenge 2004

DC04, to be completed in April 2004 Reconstruct 50 million events Copes with 25 Hz at 2×1033 cms-2 s-1 for 1 month These are supposed to be events in the Tier-0 center, i.e. events

passing the HLT– From the computing point of view the test is the same if these events

are simple minimum bias– This is a great opportunity to reconstruct events which can be used for

full analysis (Physics-TDR) Define and validate datasets for analysis

– Identify reconstruction and analysis objects each group would like to have for the full analysis

– Develop selection algorithms necessary to obtain the required sample

Prepare for “mission critical” analysis test event model– Look at calibration and alignment

Physics and computing validation of Geant4 detector simulation

Page 6: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

7March 24, 2003 Vincenzo Innocente CERN/EP

How data analysis beginsHow data analysis begins

The result of the reconstruction will be saved along with the raw data in an Object database

Monitoring, calibrationHLTFU Filter UnitSU Server UnitPU Processing Unit

On

line

Off

line

Expresslines

Reconstruction, reprocessing, analysis

latency: (minutes – hours)

Page 7: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

8March 24, 2003 Vincenzo Innocente CERN/EP

Data Challenge 2004Data Challenge 2004

… … … …

DigitizationORCA

Digis:raw data

bx

MB… … … … MC ntuples

Event generationPYTHIA

b/ e/ JetMet

AnalysisIguana/

Root/PAW

Ntuples:MC info,tracks,

etc

DST strippingORCA

… … … …

… … … …

Reconstruction,L1, HLTORCA

DST

Detector simulationOSCAR

Detector HitsMB… … … …

Calibration

Page 8: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

9March 24, 2003 Vincenzo Innocente CERN/EP

Jets

CaloClusters TkTracks

CaloRecHits TkRecHits

CaloDataFrames TkDigis

JetReconstructor

< cut

r < rcut

Calib-AAlign-C

TkHitsRandom Nr.

CaloHits

Reconstruction

DAQ or

Simulation

High granularity “DAG”High granularity “DAG”Calibrations and detail detector and physics studies require access to few objects per event.

These studies will need also to access to “conditions” data associated to these objects.

Access pattern to the very same object may be very different for different use cases: a flexible definition of “datasets” (associated to use-cases) is required.

Page 9: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

10

March 24, 2003 Vincenzo Innocente CERN/EP

Analysis EnvironmentsAnalysis EnvironmentsReal Time Event Filtering and Monitoring

Data driven pipeline High reliability

Pre-emptive Simulation, Reconstruction and Event Classification Massive parallel batch-sequential process Excellent error recovery and rollback mechanisms Excellent scheduling and bookkeeping systems

Interactive Statistical Analysis Rapid Application Development environment Excellent visualization and browsing tools Human “readable” navigation

Page 10: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

11

March 24, 2003 Vincenzo Innocente CERN/EP

Three Computing Environments: Three Computing Environments: Different ChallengesDifferent Challenges

Centralized quasi-online processing Keep-up with the rate Validate and distribute data efficiently

Distributed organized processing Automatization

Interactive chaotic analysis Efficient access to data and “Metadata” Management of “private” data Rapid Application Development

Page 11: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

12

March 24, 2003 Vincenzo Innocente CERN/EP

The Ultimate Challenge:The Ultimate Challenge:A Coherent Analysis EnvironmentA Coherent Analysis Environment

Beyond the interactive analysis tool (User point of view) Data analysis & presentation: N-tuples, histograms, fitting, plotting, …

A great range of other activities with fuzzy boundaries (Developer point of view) Batch Interactive from “pointy-clicky” to Emacs-like power tool to scripting Setting up configuration management tools, application frameworks and reconstruction

packages Data store operations: Replicating entire data stores; Copying runs, events, event parts

between stores; Not just copying but also doing something more complicated—filtering, reconstruction, analysis, …

Browsing data stores down to object detail level 2D and 3D visualisation Moving code across final analysis, reconstruction and triggers

Today this involves (too) many tools

Page 12: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

13

March 24, 2003 Vincenzo Innocente CERN/EP

Federation

wizards

Detector/EventDisplay

Data Browser

Analysis jobwizards

Generic analysis Tools

ORCAORCA

FAMOSFAMOS

LCGLCGtoolstools

GRIDGRID

OSCAROSCARCOBRACOBRA

DistributedData Store

& ComputingInfrastructure

CMSCMStoolstools

Architecture OverviewArchitecture Overview

ConsistentUser Interface

Coherent set of basic tools and mechanisms

Software development and installation

Page 13: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

14

March 24, 2003 Vincenzo Innocente CERN/EP

Simulation, Reconstruction & Analysis Simulation, Reconstruction & Analysis Software SystemSoftware System

SpecificFramework

Object

Persistency Geant3/4 CLHEP AnalysisTools

C++ standard library

Extension toolkit

Reconstruction

Algorithms

Data

Monitoring

Event

Filter

Physics

Analysis

CalibrationObjects Event Objects

ConfigurationObjects

Generic Application Framework

Physics modules

adapters and extensions

BasicServices

Grid-Aware Data-Products

Grid-enabled

Application

Framework

Uploadable on the Grid

LC

G

Page 14: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

16

March 24, 2003 Vincenzo Innocente CERN/EP

VariedVaried components and data flows components and data flowsOne PortalOne Portal

Production system and data repositories

ORCA analysis farm(s) (or distributed `farm’ using grid queues)

RDBMS based data

warehouse(s)

PIAF/Proof/..type analysis

farm(s)

Local disk

User

TAGs/AODsdata flow

Physics Query flow

Tier 1/2

Tier 0/1/2

Tier 3/4/5

Productiondata flow

TAG and AOD extraction/conversion/transport services

Data extractionWeb service(s)

Local analysis tool: Iguana/ROOT/… Web browser

Query Web service(s)Tool plugin

module

Page 15: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

17

March 24, 2003 Vincenzo Innocente CERN/EP

CLARENS: a Portal to the GridCLARENS: a Portal to the GridGrid-enabling the working environment for physicists' data analysisClarens consists of a server communicating with various clients via the commodity XML-RPC protocol. This ensures implementation independence.The server will provide a remote API to Grid tools:

Client

RPC

Web Server

Clarens

Service

http

/htt

ps

The Virtual Data Toolkit: Object collection accessData movement between Tier centres using GSI-FTPCMS analysis software (ORCA/COBRA),Security services provided by the Grid (GSI)No Globus needed on client side, only certificate

Current prototype is running on the Caltech proto-Tier2

Page 16: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

19

March 24, 2003 Vincenzo Innocente CERN/EP

Example of analysis on the gridExample of analysis on the grid

Web Server

Clarens

Service

Service

Service

Service

Remote batch service: resource allocations,

control, monitoring

Local analysis Environment:Data cachebrowser, presenterResource broker?

Remote web service: act as gateway between users and remote facility

Page 17: CHEP `03  March 24, 2003 Vincenzo Innocente CERN/EP CMS Data Analysis: Present Status, Future Strategies Vincenzo

20

March 24, 2003 Vincenzo Innocente CERN/EP

SummarySummarySuccess of analysis software will be measured by the ability to

provide at the same time a simple, coherent and stable view to the physicists retaining the flexibility required to achieve the maximal computing efficiency

CMS is responding to this challenge developing an analysis software architecture based on a layered structure a consistent interface to the physicist

– Customizable

– Implemented in many flavors (Qt, python, root, web-browser) A flexible application framework

– Mainly responsible of managing event-data with high-granularity A set of back-end services

– Specialized for different use-cases and computing environments