lcgcaf:cdf submission portal to lcg

15
November 11-17 SC06 Tampa LcgCAF:CDF submission portal to LcgCAF:CDF submission portal to LCG LCG Federica Fanzago Federica Fanzago for for CDF-Italian Computing Group CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel Jeans, Subir Sarkar Donatella Lucchesi, Daniel Jeans, Subir Sarkar and Igor Sfiligoi and Igor Sfiligoi

Upload: edie

Post on 25-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

LcgCAF:CDF submission portal to LCG. Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel Jeans, Subir Sarkar and Igor Sfiligoi. L peak : 2.3x10 32 s -1 cm -2 1.96 TeV - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LcgCAF:CDF submission portal to LCG

November 11-17 SC06 Tampa

LcgCAF:CDF submission portal to LCGLcgCAF:CDF submission portal to LCG

Federica FanzagoFederica Fanzagofor for

CDF-Italian Computing GroupCDF-Italian Computing GroupGabriele Compostella, Francesco Delli Paoli, Donatella Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel Jeans, Subir Sarkar and Igor SfiligoiLucchesi, Daniel Jeans, Subir Sarkar and Igor Sfiligoi

Page 2: LcgCAF:CDF submission portal to LCG

2 November 11-17 SC06 Tampa

CDF experiment and data volumesCDF experiment and data volumes

D0

CDF

Tevatron

Main Injector & Recycler

p source

• Lpeak: 2.3x1032s-1cm-2 1.96 TeV

• 36 bunches, 396 ns crossing

time

Chicago

Inte

gra

ted

Lu

min

osit

y (

fb-

1)

9

8

7

6

5

4

3

2

1

0

We are here today

By FY07:Double data up to FY06

By FY09:Double data up to FY07

S

~ 2

~ 5.5

~ 9

data

~ 9

~ 20

~ 23

cpu needs

30 mA/hr

25 mA/hr

20 mA/hr

15 mA/hr

9/30/03 9/30/04 9/30/05 9/30/06 9/30/07 9/30/08 9/30/09

Page 3: LcgCAF:CDF submission portal to LCG

3 November 11-17 SC06 Tampa

CDF Data Handling Model: how it wasCDF Data Handling Model: how it was

Head Node

Monitor

Mailer

Submitter User Desktop

GUI

Worker Node

CafExe

MyJob.sh

MyJob.exe

Developed by: H. Shih-Chieh, E. Lipeles, M. Neubauer, I. Sfiligoi,F. Wuerthwein

Data processing: dedicated production farm at FNAL User analysis and Monte Carlo production: CAF Central Analysis Facility. - FNAL hosts data->farm used to analyze them - Remote sites decentralized CAF (dCAF) produce Monte Carlo, some sites (ex. CNAF) have also data. CAF and dCAF are CDF dedicated farms structured:

Dedicated CAF

User tarball(Myjob.exe)

condor

Page 4: LcgCAF:CDF submission portal to LCG

4 November 11-17 SC06 Tampa

Moving to a GRID Computing ModelMoving to a GRID Computing Model

Reasons to move to a GRID computing model

Need to expand resources, luminosity expect to increase by factor ~4 until CDF ends. Resource were in dedicated pools, limited expansion and need to be maintained by CDF persons GRID resources can be fully exploited before LHC experiments will start analyze data CDF has no hope of have a large amount of resources without GRID!

LcgCAFLcgCAF

Page 5: LcgCAF:CDF submission portal to LCG

5 November 11-17 SC06 Tampa

LcgCAF: General ArchitectureLcgCAF: General Architecture

CDF-SE(SE of GRID

site)

cdf-ui~

cdf-head

…Robotic tape Storage (FNAL)

GRID site

User DesktopGUI job

CDF user musthave access toCAF clients andSpecific CDFcertificate

ou

tpu

t

output

…job

…GRID site

Page 6: LcgCAF:CDF submission portal to LCG

6 November 11-17 SC06 Tampa

Job Submission and ExecutionJob Submission and Execution

Monitor

Submitter Mailer

Job …

WMS

Grid site

Grid site

cdf-ui~

cdf-head

Daemons running on CDF head:Submitter: receive user job, create its wrapper to run it on GRID environment, make available user tarball Monitor: provide job information to user Mailer: send e-mail to user when job is over with a

job summary (useful for user bookkeeping) Job wrapper (CafExe): run user job, copy whatever is left at the end of user code to user specified location

CafExe

Page 7: LcgCAF:CDF submission portal to LCG

7 November 11-17 SC06 Tampa

Job MonitorJob Monitor

Information System

WMS: keeps track of job status on the GRID

WN: job wrapper collects information like load dir, ps, top..

cdf-ui~

cdf-headJob information«««

WN information«««

Web based Monitor: informationon running jobs for all users and jobs/users history are read fromLcgCAF Information System

Interactive job Monitor: running jobs information are read fromLcgCAF Information System and sent to user desktop. Available commands: CafMon list CafMon kill CafMon dir CafMon tail CafMon ps CafMon top

««

«

«

Page 8: LcgCAF:CDF submission portal to LCG

8 November 11-17 SC06 Tampa

Lcg site: CNAF

Lcg site: Lyon

… …

Proxy Cache

Proxy Cache

/home/cdfsoft

/home/cdfsoft

A CDF job to run needs CDF code and Run Condition DB available to WN, since we cannot expect all the GRID sites to carry them, Parrot is used as virtual file systemfor code and FroNTier for DB. Since most request are the same, a local cache is used: Squid web proxy cache

CDF Code DistributionCDF Code Distribution

Fermilab server

User name placeholder
Page 9: LcgCAF:CDF submission portal to LCG

9 November 11-17 SC06 Tampa

…CDF-SE

Job outputWorker Nodes

…CDF-Storage

gridftp

KRB5 rcp

User job output is transferred to CDF Storage Element (SE) via gridftp or to a CDF-Storage location via rcp From CDF-SE or CDF-storage output copied using

gridftp to FNAL fileservers. Automatic procedure copies files from fileservers to tape and declare them to the CDF catalogue

CDF Output storageCDF Output storage

Page 10: LcgCAF:CDF submission portal to LCG

10 November 11-17 SC06 Tampa

LcgCAF used by power users for B Monte Carlo Production but generic users also start to use it!

Last month running sections

CDF VO monitored with standard LCG tools: GridIce

LcgCAF Monitor

1 week Italian resources

LcgCAF Usage: first GRID exerciseLcgCAF Usage: first GRID exercise

Page 11: LcgCAF:CDF submission portal to LCG

11 November 11-17 SC06 Tampa

LcgCAF UsageLcgCAF Usage

Samples Events on tape

with 1 Recovery

Bs->Ds Ds->, Ds->K*K, Ds->KsK

~6,2 · 106 ~94% ~100%

CDF B Monte Carlo production: ~5000 jobs in ~1 month

Failures: GRID: site miss-configuration, temporary authentication problems, WMS overload LcgCAF: Parrot/Squid cache stackedRetry mechanism: GRID failures WMS retry LcgCAF failures “home-made” retry (logs parsing)

User jobs: ~1000 jobs in a week ~100% executed on these sites:

GridKA

INFN-BA

INFN-CT

INFN-LNL

Page 12: LcgCAF:CDF submission portal to LCG

12 November 11-17 SC06 Tampa

o No policy mechanism on LCG, user policy rely on site rules, for LcgCAF user “random priority” Plan to implement a mechanism of policy management on top of GRID policy tools o Interface between SAM (CDF catalogue) and GRID services under development to allow:

output file declaration into catalogue directly from

LcgCAF automatic transfer of output files from local CDF storage to FNAL tape

data analysis on LCG sites

CDF has developed a portal to access the LCG resources used to produce Monte Carlo data and for every-day user jobs GRID can serve also a running experiment

Conclusions and Future developmentsConclusions and Future developments

Page 13: LcgCAF:CDF submission portal to LCG

13 November 11-17 SC06 Tampa

Page 14: LcgCAF:CDF submission portal to LCG

14 November 11-17 SC06 Tampa

Site Country KSpecInt2K

CNAF-T1 Italy 1500

INFN-Padova Italy 60

INFN-Catania Italy 248

INFN-Bari Italy 87

INFN-Legnaro Italy 230

IN2P3-CC France 500

FZK-LCG2 Germany

2000

IFAE Spain 20

PIC Spain 150

UKI-LT2-UCL-HE UK 300

Liverpool UK 100

Currently CDF access the following sites:

CDF has produced and is producing Monte Carlo data exploiting these resources demonstrating that GRID can serve also a running experiment

More details …More details …

Page 15: LcgCAF:CDF submission portal to LCG

15 November 11-17 SC06 Tampa

Samples Events on tape

with 1 Recovery

Bs->Ds Ds-> 1969964 ~92% ~100%

Bs->Ds Ds->K*K

2471751 ~98% ~100%

Bs->Ds Ds->KsK

1774336 ~92% ~100%

CDF B Monte Carlo production

#segments succeeded# segments submitted

GRID failures: site miss-configuration, temporary authentication problems, WMS overloadLcgCAF failures: Parrot/Squid cache stackedRetry mechanism: necessary to achieve ~100% efficiencyWMS retry for GRID failures and LcgCAF “home-made” retry based on logs parsing

Performances (details)Performances (details)