parallel interactive and batch hep-data analysis with proof

26
1 Marek Biskup ACAT2005 PROOF Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup **, Rene Brun**, Philippe Canal***, Derek Feichtinger****, Gerardo Ganis**, Guenter Kickinger**, Andreas Peters**, Fons Rademakers** * - MIT ** - CERN *** - FNAL **** - PSI

Upload: karen-harding

Post on 03-Jan-2016

40 views

Category:

Documents


2 download

DESCRIPTION

Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup **, Rene Brun**, Philippe Canal***, Derek Feichtinger****, Gerardo Ganis**, Guenter Kickinger**, Andreas Peters**, Fons Rademakers**. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

1 Marek Biskup ACAT2005PROOF

Parallel Interactive and Batch HEP-Data Analysis

with PROOF

Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,

Derek Feichtinger****, Gerardo Ganis**, Guenter Kickinger**, Andreas Peters**,

Fons Rademakers**

* - MIT ** - CERN *** - FNAL **** - PSI

Page 2: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

2 Marek Biskup ACAT2005PROOF

Outline

■ Data analysis model of ROOT

■ Overview of PROOF

■ Recent developments

■ Future plans - Interactive-Batch data analysis

Page 3: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

3 Marek Biskup ACAT2005PROOF

ROOT Trees

■ Tree – main data structure of ROOT■ Set of records (entries)■ Record may contain basic C types (int, double, arrays) and

any C++ object, polymorphic object, collection, stl collection, etc, e.g.:

stl::list<TrackClass> tracks; Electrons

➔ Int_t NoElectrons;➔ Double_t Momentum[NoElectrons][4];➔ Float_t Position[NoElectrons][4];

Muons➔ Int_t NoMuons;➔ …

■ Provide efficient access to partial entry data■ Typical size < 2GB

Page 4: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

4 Marek Biskup ACAT2005PROOF

Trees in memory and in files

Each Leaf is an object (c++ object, array, basic type). Each Branch groups several Leafs/Branches.

0123456789101112131415161718

T.Fill()

T.GetEntry(6)

T

Memory

Branch

Leaf

Page 5: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

5 Marek Biskup ACAT2005PROOF

Tree data storage on disk

Tree Header describing data structure

Buffer 1

Buffer 2

…Branch data split into compressed

Buffers

Page 6: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

6 Marek Biskup ACAT2005PROOF

ROOT Trees - GUI

8 leaves of a branchnamed ‘Electrons’

Double-clickto histogram

a leaf

8 Branches of tree named ‘T’

Page 7: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

7 Marek Biskup ACAT2005PROOF

Tree Friends

0123456789101112131415161718

0123456789101112131415161718

0123456789101112131415161718

Public

read

Public

read

User

Write

Entry # 8

Behave in exactly the same way as a single Tree!

Page 8: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

8 Marek Biskup ACAT2005PROOF

ROOT Chains

■ A typical Tree: < 2GB – you can process it on your laptop■ Chain – list of trees

e.g. 1000 files – the processing takes long time!

File 1, entries

0 - 800 File 2, entries

801 - 1340File 3,

entries 1341 - 2000

. . .

Behave in exactly the

same way as a single Tree!

Page 9: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

9 Marek Biskup ACAT2005PROOF

Tree Viewer

Drag and drop

variables to create

expressions

And click the Draw button

Page 10: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

10 Marek Biskup ACAT2005PROOF

Chain.Draw()chain.Draw() is a function called by the GUI for drawing

chain.Draw( “sumetc:nevent:nrun”, “”, “col”);

chain.Draw( “nevent:nrun”, “”, “lego”);

Page 11: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

11 Marek Biskup ACAT2005PROOF

Advanced data processing

■ Preprocessing and initialization■ Processing each entry (loop over

all files and entries in each file)■ Post processing and clean-up

Do not assume anything about the

order in which Process() is called

for different entries!

Selectors contain only the functions important for processing

We read only one branch

Page 12: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

12 Marek Biskup ACAT2005PROOF

ROOT Analysis Model

Client

Local file

Remote file(dCache, Castor,

RFIO, Chirp)

Rootd/xrootdserver

ROOT standard model Files analyzed on a local computer

Remote data accessed via remote fileserver (rootd/xrootd)

Page 13: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

13 Marek Biskup ACAT2005PROOF

Data transfer takes time.

PROOF

Bring the KiloBytes to the PetaBytes and not the PetaBytes to the KiloBytes

■ Parallel interactive analysis of ROOT Data■ Using the same ROOT Selectors (transparency!)■ Execution on clusters of heterogeneous computers

(scalability!)

Normal Laptop/PC can process up to 10MB/s. Current experiments and LHC need much more

Page 14: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

14 Marek Biskup ACAT2005PROOF

PROOF Basic Architecture

Slaves

ClientMaster Files

Commands, scripts

Histograms, plots

Single-Cluster mode The Master divides the work among the

slaves

After the processing finishes, merges the results (histograms, scatter plots)

And returns the result to the Client

Page 15: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

15 Marek Biskup ACAT2005PROOF

Workflow for tree analysis

Initialization

Process

Process

Process

Process

Wait for nextcommand

Slave 1Process(“ana.C”)

Pac

ket

gen

erat

or

Initialization

Process

Process

Process

Process

Wait for nextcommand

Slave NMaster

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

GetNextPacket()

SendObject(histo)SendObject(histo)

Addhistograms

Returnresults

0,100

200,100

340,100

490,100

100,100

300,40

440,50

590,60

Process(“ana.C”)

Page 16: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

16 Marek Biskup ACAT2005PROOF

PROOF and Selectors

No user’s control of the entries loop!

Many Trees are

being processed

Initialize each slave

The code is shipped to each slave and SlaveBegin(), Init(), Process(), SlaveTerminate() are executed there

The same code works also without PROOF.

Page 17: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

17 Marek Biskup ACAT2005PROOF

PROOF Sequential mode

ClientMaster Files

Commands, scripts

Histograms, plots, canvases

The Master executes scripts (Selectors) and returns results to the Client

Canvases will be fetched from the Master automatically

Pseudo-remote desktop (better than XWindow for WAN)

From the users point of view it

works in the same way as the

standard proof mode

The canvas is automatically displayed after the processing has finished

Executes the selector and creates an off-screen canvas

Page 18: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

18 Marek Biskup ACAT2005PROOF

PROOF – Drawing a histogram

Chains may be also created automatically by a query to a grid catalog

Page 19: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

19 Marek Biskup ACAT2005PROOF

GUI and real time feedback

Feedback histogram,

updated every (e.g.) 1 second

Chain definition (header) is fetched

from the PROOF master

Page 20: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

20 Marek Biskup ACAT2005PROOF

Current Limitations of PROOF

Processing blocks the client

Permanent connection to the master.

No dynamic usage of the GRID.

■ Intended for interactive usage: Typical queries time – several minutes.

■ Designed to work on a local cluster with static configuration.

Originally:

Page 21: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

21 Marek Biskup ACAT2005PROOF

Typical QueriesInteractive/Batch queries

GUI

Commands

scriptsBatch

connected

connected ordisconnected

disconnected

Page 22: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

22 Marek Biskup ACAT2005PROOF

Analysis session snapshot

What are planning to implement:

AQ1: 1s query produces a local histogram

AQ2: a 10mn query submitted to PROOF1

AQ3->AQ7: short queries

AQ8: a 10h query submitted to PROOF2BQ1: browse results of AQ2

BQ2: browse temporary results of AQ8

BQ3->BQ6: submit 4 10mn queries to PROOF1

CQ1: Browse results of AQ8, BQ3->BQ6

Monday at 10h15

ROOT sessionOn my laptop

Monday at 16h25

ROOT sessionOn my laptop

Wednesday at 8h40

sessionon any web

browser

Page 23: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

23 Marek Biskup ACAT2005PROOF

Planned features

■ Session disconnect and reconnect

■ Asynchronous queries

■ Start-up of slaves via Grid job scheduler

■ Allow slaves to join/leave the computation

■ Slaves calling out to master (firewalls)

Page 24: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

24 Marek Biskup ACAT2005PROOF

PROOF on the Grid

PROOFPROOF

USER SESSIONUSER SESSION

PROOF PROOF SLAVE SLAVE SERVERSSERVERS

PROOF MASTERPROOF MASTER SERVERSERVER

PROOF PROOF SLAVE SLAVE SERVERSSERVERS

PROOF PROOF SLAVE SLAVE SERVERSSERVERS

Guaranteed site access throughPROOF Sub-Masters calling outto Master (agent technology)

PROOF SUB-PROOF SUB-MASTERMASTER SERVERS SERVERS

PROOFPROOF

PROOFPROOF

PROOFPROOF

Grid/Root Authentication

Grid Access Control Service

TGrid UI/Queue UI

Proofd Startup

Grid Service Interfaces

Grid File/Metadata Catalogue

Client retrieves listof logical files (LFN + MSN)

Page 25: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

25 Marek Biskup ACAT2005PROOF

Summary

■ ROOT is a powerful analysis framework with very efficient data storage mechanisms.

■ PROOF works well for interactive parallel ROOT data analysis on a local cluster

Fully integrated with ROOT – you can use chains with PROOF in the same way as locally.

You can use the same Selectors you’ve written for local processing.

But it was designed for short-duration interactive queries.■ PROOF is evolving: we plan to accommodate longer

running queries. Disconnect from and reconnect to a running query. Non-Blocking queries. Dynamic configuration (using the GRID).

Page 26: Parallel Interactive and Batch  HEP-Data Analysis  with PROOF

26 Marek Biskup ACAT2005PROOF

Questions