cern – alice offline – thu, 03 feb 2005 – marco meoni - 1/18 monitoring of a distributed...

18
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting Thursday 3rd February 2005 Marco MEONI

Upload: ralph-simpson

Post on 02-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18

Monitoring of a distributedcomputing system:

the AliEn Grid

Alice Offline weekly meetingThursday 3rd February 2005

Marco MEONI

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18

Content• Document I’ve been working on since mid Dec 2004• ~100 pages up to now• Not too far from the final version• Available on http://... (let me discuss the thesis first)

1. ALICE and AliEn

~ 35 pages

~ 65 pages

4. MonALISA adaptations

and extensions

3. MonALISA2. Grid Monitoring

5. PDC 2004 monitoring

and results

6. Conclusion and Outlooks

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 3/18

Section I

Grid Concepts and Monitoring

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18

Grid, ALICE, AliEn• Grid Computing overview

“coordinated use of large sets of different, geographically distributed resources in order to allow high-performance computation”

• ALICE experiment and ALICE Off-line

• AliEn • PULL rather than PUSH architecture,• scheduling service does not need to know the status of all other resources in the system, • robust and fault tolerant system where resources can come and go at any point in time. • possible to interface an entire foreign Grid as a large Computing and Storage Element (LCG)

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18

• GMA architecture• R-GMA: an example of implementation• Jini (Sun) provides the technical basis

Grid Monitoring

Producer

Consumer

RegistryTransfer

Data

Storelocation

Lookuplocation

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18

Section II

MonALISA Adaptations and Extensions

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18

• Farms monitoring

MonALISA Adaptations

• User Java class to interface MonALISA and bash script to monitor the site

• A WEB Repository as a front-end• Stores history of the monitored data • Plots any kind of chart• Interfaces to user code

(custom consumers, config modules, new charts, distributions)

MonALISAAgent

WNs

CEBash monitoring script

Java interface class

Monitored data

User code MonALISA frameworkALICE’s resources

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18

Repository• Additional Java thread to feed directly the repository

Ad hoc java thread

Monitored data

TOMCATJSP/servlets

AliEn Jobs Monitoring• If the Grid executes jobs then it works!• Centralized or distributed?• AliEn native APIs to retrieve job status snapshots

Job is submitted

>1h

>3h

(Error_I)

(Error_A)

(Error_S)

(Error_E)

(Error_R)

(Error_V, VT, VN)

(Error_SV)

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18

Repository DataBase(s)• 7.5 Gb of monitored information, 52M records• During DCs data from ~2K monitored parameters arrive every 2/3 mins

alimonitor.cern.chaliweb01.cern.ch

Online Replication

• Data Replication:MASTER DB SPARE DB

Grid AnalysisData collecting and Grid Monitoring

1min

Averaging

process

10 min 100 min

60 bins for each basic

information

FIFO

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18

Source Category Number Examples

AliEn API CE load factors 63 Run load, queue load

SE occupancy 62 Used space, free space, files number

Job information 557 Running, saving, done, failed

Soap calls CERN Network traffic 29 MBs, files

LCG CPU – Jobs 48 Free CPUs, job running and waiting

ML services on MQ Job summary 34 Running, saving, done, failed

AliEn parameters 15 MySQL load, Perl processes

ML services Sites info 1060 Paging, threads, I/O, processes

Job execution efficiency Successfuly done jobs / all submitted jobs

System efficiency Error (CE) free jobs / all submitted jobs

AliRoot efficiency Error (AliROOT) free jobs / all submitted jobs

Resource efficiency Running (queued) jobs / max_running (queued)

Monitored parameters

Derived classes…

1868

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18

Extensions

• Job monitoring by user

• Repository Web Services• Application Monitoring (ApMon) at WNs• Grid Analysis

•AliEn “ps –xxx” commands•Job’s JDL•Results presented in the same web front end

•Repository interfaced to ROOT and Carrot

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18

Section III

PDC 2004 Monitoring and Results

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18

•Start 10/03, end 29/05 (58 days active)•Maximum jobs running in parallel: 1450•Average during active period: 430

Sum of all sites

Phase 1 (simulation)

Successfully done jobs all submitted

jobs

Error (CE) free jobs

all submitted

jobs

Error (AliROOT) free jobs all submitted

jobs

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18

Phase 2 (merging)

as in the 1st phase, general equilibrium in CPU contribution not sigle site dominating the production jobs successfully done 76% AliEn, 24% LCG

Jobs failure Reason Rate

Submission CE scheduler not responding 1%

Loading input data Remote SE not responding 3%

During execution Job aborted, not started, killed, WN malfunction 10%

Saving output data Local SE not responding 2%

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18

Phase 3 (analysis)

• Occupancy changes respect the number of queued jobs in the local batch system

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 16/18

Salutations…

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18

Credits• Federico, Predrag and Peter

they could pick up another TS

• Latchezarcontinuos help and suggestions, review of my thesis

• MonALISA teamcollaborative anytime I needed

• Guentervery useful integrations

• my fianceemoral support: “did they hire you just to look at some plots?”

CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 18/18

…thanks to all

…and all the others I couldn’t find a pic!