distributed monte carlo production forabbott/doe_review_2011/snow_doerev2010.pdf · dzero 's...

Post on 16-Oct-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Distributed Monte CarloProduction for

Joel SnowLangston University

DOE Review March 2011

DOE Review March 2011 Joel Snow Langston University 2

Outline● Introduction● FNAL SAM● SAMGrid● Interoperability with OSG and LCG ● Production System● Production Results● LUHEP Computing● Summary

DOE Review March 2011 Joel Snow Langston University 3

Introduction● Covers my tenure as MC production coordinator● Simulation data (MC) crucial to physics analysis● Tevatron luminosity and hence raw data volume

is at record levels● Challenge for analysts and production● Personnel & computing resources migrating to

LHC experiments● DZero strategy

– Increase automation

– Leverage resources and support

DOE Review March 2011 Joel Snow Langston University 4

Evolution ● Mature experiment, but nimble

– history of adopting innovative technologies● distributed data handling - SAM● early adopter of the grid for production - SAMGrid

– significant investment in these technologies

● Grid technology allows opportunistic usage– DZero can mix “traditional” dedicated and

opportunistic resources

● Grid interoperability– Leverages resources and support, reduces

personnel needs per CPU hour

DOE Review March 2011 Joel Snow Langston University 5

Sequential data Access via Metadata

Sequential data Access via Metadata

● Fermilab system first used by DZero

● SAM distributed data handling system predates grid

● Set of servers working together to store and retrieve files and metadata

● Permanent storage and local disk caches

● Database tracks location, metadata of files, job processing history

● Delivers files to jobs (using GridFTP over WAN), provides job submission capabilities

DOE Review March 2011 Joel Snow Langston University 6

SAMGrid● Fermilab developed grid first used by DZero for global MC

production in 2004

● SAMGrid = SAM + Job and Information Management (JIM) components

● Provides the user with transparent remote job submission, data processing and status monitoring.

● VDT based (Globus + Condor)

● Logically consists of

– Multiple execution sites

– Resource selector

– Multiple Job Submission (Scheduler) sites

– Multiple Clients (User Interface) to Submission site.

DOE Review March 2011 Joel Snow Langston University 7

SAMGrid Interoperability● As Open Science Grid (OSG) and LHC

Computing Grid (LCG) became operational it was desirable to leverage these resources for DZero

● FNAL and DZero developed and deployed SAMGrid interoperability with both LCG and OSG resources

● Execution site acts as a Forwarding node – packages SAMGrid jobs for OSG/LCG job

submission via Condor-G

DOE Review March 2011 Joel Snow Langston University 8

Consolidation, Automation, Exploitation

● SAMGrid sites require operational manpower and expert support

● People power and FNAL support migrating to LHC experiments

● Increase automation - Automc● Reduce number of SAMGrid sites, increase

use of OSG and LCG – comes with support ✔

– provides opportunistic job slots ✔

DOE Review March 2011 Joel Snow Langston University 9

Production SystemMC production gets work

from the SAM Request System

Physics groups' MC requests are parametrized and prioritized as a Python object

DOE Review March 2011 Joel Snow Langston University 10

Automatic Monte Carlo Request Processing

● Developed Automc System – in use at FNAL– Handles official DZero MC production at all but 2 sites

● From approved request to final data storage ● Easy to use – minimizes manpower needs● Site independent

– deploy for any grid site (SAMGrid, OSG, LCG)

– capable of managing many sites

● Handle recovery of common failures● Integrates with existing MC request priority protocol

DOE Review March 2011 Joel Snow Langston University 11

AutoMC MonitoringRunning at FNAL & managing production at 39 sites

http://www-d0.fnal.gov/computing/mcprod/dajd/dajd_status.html

DOE Review March 2011 Joel Snow Langston University 12

Production System Resources

● MC production uses a variety of dedicated and opportunistic resources on 4 continents– Non-gridNon-grid site at ccin2p3 Lyon (FR) – very productive,

flexible

– Native SamgridNative Samgrid sites: FZU (CZ), GridKa (DE), LUHEP (US), USTC (CN)

– LCG resourcesLCG resources: CE's, SE's, and Samgrid-LCG infrastructure in FR, UK, NL

– OSG resources:OSG resources: CE's, SE's, and Samgrid-OSG infrastructure in US

DOE Review March 2011 Joel Snow Langston University 13

MC Production ResultsLooking back at the last 30 days

Averaging 5.8M events per day and totaling 172.8M events in 30 days

DOE Review March 2011 Joel Snow Langston University 14

MC Production ResultsLooking back at the last year

Averaging 49M events per week and totaling 2.6B events in a year

cumulative since September 2005.(2010/02/14 - 2011/02/14)

DOE Review March 2011 Joel Snow Langston University 15

MC Production ResultsLooking back at the last year by production segment

52 week averages per week (2010/02/14 - 2011/02/14)

Non-grid: 19.8M, OSG: 11.4M, Samgrid: 12.6M, LCG: 4.9M

DOE Review March 2011 Joel Snow Langston University 16

MC Production ResultsLooking back at the last year by production segment

Cumulative since September, 2005

52 week totals (2010/02/14 - 2011/02/14)Non-grid: 1041M, OSG: 596M, Samgrid: 658M, LCG: 257M

40.8% 23.3% 25.8% 10.1%

Production Last Year By Segment

Nongrid OSG Samgrid LCG

DOE Review March 2011 Joel Snow Langston University 17

MC Production Geographic Distribution

Events Last Year:

Europe 1925M

N. America 574M

Asia 29M

S. America 24M 75.4%

0.9%

22.5%

1.1%

Europe S. AmericaN. America Asia

(2010/02/14 - 2011/02/14)

DOE Review March 2011 Joel Snow Langston University 18

MC Production ResultsLooking back at the last 5.5 years

Averaging 19.2M events per week and totaling 2.82B events

cumulative since September 2005.(2005/09/05 - 2011/02/14)

DOE Review March 2011 Joel Snow Langston University 19

MC Production ResultsLooking back at the last 5.5 years by production segment

5.5 year averages per week (2005/09/05 - 2011/02/14)

Non-grid: 8.0M, OSG: 4.8M, Samgrid: 5.3M, LCG: 1.1M

DOE Review March 2011 Joel Snow Langston University 20

MC Production ResultsLooking back at the last 5.5 years by production segment

Cumulative since September, 2005

5.5 year totals (2005/09/05 - 2011/02/14)Non-grid: 2.26B, OSG: 1.37B, Samgrid: 1.51B, LCG: 306M

41.5% 25.2% 27.7% 5.6%

Production Last Year By Segment

Nongrid OSG Samgrid LCG

DOE Review March 2011 Joel Snow Langston University 21

Production Results Last 7 YearsDZero MC Production in Millions of Events per year ending 12/26

Year Total Non-Grid OSG LCG2010 2388.5 1011.2 614.8 539.2 223.32009 1122.6 540.3 217.9 364.2 0.32008 794.8 315.6 213.6 259.7 5.82007 398.2 109.1 158.1 96.5 34.42006 348.0 144.4 195.5 0.5 7.62005 98.1 68.6 29.5 0.0 0.02004 42.4 41.8 0.6 0.0 0.0

SAMGrid

2004 2005 2006 2007 2008 2009 2010

0

500

1000

1500

2000

2500

3000

DZero MC Production in Millions of Events

LCGOSGSAMGridNon-Grid

DOE Review March 2011 Joel Snow Langston University 22

Production Results Last 7 YearsDZero MC Production in Terabytes of Data per year ending 12/26

Year Total Non-Grid OSG LCG2010 221.0 83.3 61.8 53.7 22.32009 95.3 42.7 19.8 32.8 0.02008 67.8 26.9 18.4 22.0 0.52007 31.6 7.3 13.2 8.2 2.92006 23.0 9.4 13.1 0.0 0.52005 6.0 4.1 1.9 0.0 0.02004 1.9 1.9 0.0 0.0 0.0

SAMGrid

2004 2005 2006 2007 2008 2009 2010

0

50

100

150

200

250

DZero MC Production in Terabytes of Data

LCGOSGSAMGridNon-Grid

DOE Review March 2011 Joel Snow Langston University 23

OU DZero MC Production2005/09/05 - 2011/02/14

OUHEP produced 306 M events and 28.4 TB data

Last year OUHEP produced 139 M events and 14.0 TB data

Cumulative since Sept. 20052010/02/14 – 2011/02/14

DOE Review March 2011 Joel Snow Langston University 24

LU DZero MC Production2005/09/05 - 2011/02/14

LUHEP produced 15.5 M events and 1.36 TB data

Last year LUHEP produced 4.6 M events and 450 GB data

Cumulative since Sept. 20052010/02/14 – 2011/02/14

DOE Review March 2011 Joel Snow Langston University 25

LUHEP Computing

● 2 grid enabled clusters both producing DØ MC

● Old Samgrid cluster- 12 job slots

● New OSG cluster - 12 job slots with small associated SE used as DØ cache

DOE Review March 2011 Joel Snow Langston University 26

Condor Q's at LUHEP

SAMGrid OSGLast Year

DOE Review March 2011 Joel Snow Langston University 27

Summary● DZero 's early deployment of grid technology

and automation has dramatically increased MC production– First deployment SAM distributed data handling

system

– Early SAMGrid deployment

– Use of OSG and LCG resources through interoperability with SAMGrid

– First opportunistic usage of OSG Storage Elements

– Automated MC production system● Anticipate adequate MC through the last analysis

top related