computing plans in cms

Computing Plans in CMS

Ian Willers

CERN

2

1.1. The Problem and IntroductionThe Problem and Introduction

2. Data Challenge – DC04

3. Computing Fabric – Technologies evolution

4. Conclusions

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

The Problem

event filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

CERN

4

Regional Centres – a Multi-Tier Model

Tier 1

Department

Desktop

CERN – Tier 0

FNALRAL

IN2P3622 M

bps2.5 Gbps

622 M

bp

s

155

mbp

s 155 mbps

Tier2 Lab a

Uni b Lab c

Uni n

6

Iterations /scenarios

Computing TDR StrategyPhysics Model

•Data model •Calibration•Reconstruction•Selection streams•Simulation•Analysis•Policy/priorities…

Computing Model

•Architecture (grid, OO,…)•Tier 0, 1, 2 centres•Networks, data handling•System/grid software•Applications, tools•Policy/priorities…

C-TDR• Computing model (& scenarios)• Specific plan for initial systems• (Non-contractual) resource planning

DC04 Data challenge

Copes with 25Hz at 2x10**33 for 1 month

TechnologiesEvaluation and

evolution Estimated AvailableResources

(no cost book for computing)

Requiredresources

SimulationsModel systems &

usage patterns

Validation of Model

7



3. Proposed Computing Fabric

4. Conclusions

8

DC04 Analysis challenge

DC04 Calibration challenge

T0

T1T2

T2

T1

T2

T2

Fake DAQ(CERN)

DC04 T0challenge

SUSYBackground

DST

HLTFilter ?

CERN disk pool~40 TByte(~20 days

data)

TAG/AOD(replica)

TAG/AOD(replica)

TAG/AOD(20

kB/evt)

ReplicaConditions

DB

ReplicaConditions

DB

HiggsDST

Eventstreams

Calibrationsample

CalibrationJobs

MASTERConditions DB

1st passRecon-

struction

25Hz1.5MB/evt40MByte/s3.2 TB/day

Archivestorage

CERNTape

archive

Disk cache

25Hz1MB/evt

raw

25Hz0.5MB recoDST

Higgs backgroundStudy (requests

New events)

Eventserver

50M events75 Tbyte

Pre Challenge Production

CERNTape

archive

Starting Now. “True” DC04 Feb,

2004

Data Challenge DC04

9

DC04 Analysis challenge

DC04 Calibration challenge

T0

T1T2

T2

T1

T2

T2

Fake DAQ(CERN)

DC04 T0challenge

SUSYBackground

DST

HLTFilter ?

CERN disk pool~40 TByte(~10 days

data)

50M events75 Tbyte

PCP

CERNTape

archive

TAG/AOD(replica)

TAG/AOD(replica)

TAG/AOD(10-100kB/evt)

ReplicaConditions

DB

ReplicaConditions

DB

HiggsDST

Eventstreams

Calibrationsample

CalibrationJobs

MASTERConditions DB

1st passRecon-

struction

25Hz2MB/evt

50MByte/s4 Tbyte/day

Archivestorage

CERNTape

archive

Disk cache

25Hz1MB/evt

raw

25Hz0.5MB recoDST

Higgs backgroundStudy (requests

New events)

Eventserver

10

MCRunJob

Pre–Challenge Production with/without GRID

Site Manager startsan assignment

RefDBPhysics Group asksfor official dataset

User starts aprivate production

Production Managerdefines assignments

DAG

job job

job

job

JDL

shellscripts

DAGMan

LocalBatch Manager

EDGScheduler

Computer farm

CMS/LCG-0

User’s Site (or grid UI) Resources

ChimeraVDL

Virtual DataCatalogue

Planner

11




4. Conclusions

12

HEP Computing

• High Throughput Computing– throughput rather than performance– resilience rather than ultimate reliability– long experience in exploiting inexpensive

mass market components– management of very large scale clusters is

a problem

13

CPU Servers

14

CPU capacity - Industry

• OpenLab study of 64 bit architecture • Earth Simulator

– number 1 computer in top 500– made in Japan by NEC– peak speed of 40 Tflops– leads Top 500 list by almost a factor 5– performance of Earth Simulator equals sum of next 12

computers– the Earth Simulator runs at 90% (vs. 10-60% for PC

farms) efficiency– Gordon Bell warned “Off-the-shelf supercomputing is a

dead end”

16

Earth Simulator

17

Earth Simulator

18

Cited problems with farms used as supercomputers

• Lack of memory bandwidth• Interconnect latency• Lack of interconnect bandwidth• Lack of high performance (parallel) I/O• High cost of ownership for large scale

systems• For CMS - does this matter?

19

LCG Testbed Structure used100 cpu servers on GE, 300 on FE, 100 disk servers on GE (~50TB), 20 tape server on GE

3 GB lines

3 GB lines

8 GB lines

64 disk server64 disk server

BackboneRouters BackboneRouters

36 disk server36 disk server

20 tape server20 tape server

100 GE cpu server100 GE cpu server

200 FE cpu server200 FE cpu server

100 FE cpu server100 FE cpu server

1 GB lines

20

HEP Computing

• Mass Storage model– data resides on tape – cached on disk– light-weight private software for scalability,

reliability, performance– petabyte scale object persistency database

products

21

Mass Mass StorageStorage

22

Mass Storage - Industry

• OpenLab – StorageTek 9940B drives driven by CERN at 1.1 GB/s

• Tape only for backup

• Main data stored on disks

• Google example

24

Disk Storage

25

Disks – Commercial trends

• Jobs accessing files over the GRID– GRID copied files to sandbox– new proposal for file access from GRID

• OpenLab – IBM 28TB TotalStorage using iSCSI disks

• iSCSI: SCSI over the Internet• OSD: Object Storage Device = Object Based

SCSI• Replication gives security and performance

26

File Access via Grid

• Access now takes place in steps:1) find site where file resides using replica

catalogue

2) check if the file is on tape or on disk, if only on tape move to disk

3) if you cannot open a remote file, copy the file to the worker node and use local I/O

4) open the file

27

Object Storage Device

28

Big disk, slow I/O tricks

HotData

ColdData

Sequential faster than randomAlways read from start to finish

31

Network trends

• OpenLab: 755MB/s over 10 Gbps Ethernet• CERN/Caltech land speed record holders (in

Guinness Book of Records)– CERN to Chicago: iPv6 single stream, 983 Mbps– Sunnyvale to Geneva: iPv4 multiple streams,

2.38 Gbps

• Network Address Translation, NAT• IPv6: IP address depletion, efficient packet

handling, authentication, security etc.

32

Port Address Translation

• PAT - A form of dynamic NAT that maps multiple unregistered IP addresses to a single registered IP address by using different ports

• Avoids iPv4 problems of limited addresses• Mapping can be done dynamically so adding nodes easier• Therefore easier to management of farm fabric?

33

iPv6

• iPv4: 32-bit address space assigned– 67% for USA– 6% for Japan– 2% for China– 0.14% for India

• iPv6: 128-bit address space

• No longer need for Network Address Translation, NAT?

34




4. Conclusions

35

Conclusions

• CMS faces an enormous challenge in computing– short term data challenges– long term developments within commercial and

scientific world

• The year 2007 is still four years away– enough for a completely new generation of computing

technologies to appear

• New inventions may revolutionise computing– CMS depends on this progress to make our

computing possible and affordable

computing plans in cms

Documents

cms computing model

computing plans

shared memory

tb of main memory

distributedmemory type

computersthe earth simulator

theoretical performance

peak performance