cern – alice offline – thu, 03 feb 2005 – marco meoni - 1/18 monitoring of a distributed...
TRANSCRIPT
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18
Monitoring of a distributedcomputing system:
the AliEn Grid
Alice Offline weekly meetingThursday 3rd February 2005
Marco MEONI
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 2/18
Content• Document I’ve been working on since mid Dec 2004• ~100 pages up to now• Not too far from the final version• Available on http://... (let me discuss the thesis first)
1. ALICE and AliEn
~ 35 pages
~ 65 pages
4. MonALISA adaptations
and extensions
3. MonALISA2. Grid Monitoring
5. PDC 2004 monitoring
and results
6. Conclusion and Outlooks
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 4/18
Grid, ALICE, AliEn• Grid Computing overview
“coordinated use of large sets of different, geographically distributed resources in order to allow high-performance computation”
• ALICE experiment and ALICE Off-line
• AliEn • PULL rather than PUSH architecture,• scheduling service does not need to know the status of all other resources in the system, • robust and fault tolerant system where resources can come and go at any point in time. • possible to interface an entire foreign Grid as a large Computing and Storage Element (LCG)
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 5/18
• GMA architecture• R-GMA: an example of implementation• Jini (Sun) provides the technical basis
Grid Monitoring
Producer
Consumer
RegistryTransfer
Data
Storelocation
Lookuplocation
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 6/18
Section II
MonALISA Adaptations and Extensions
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 7/18
• Farms monitoring
MonALISA Adaptations
• User Java class to interface MonALISA and bash script to monitor the site
• A WEB Repository as a front-end• Stores history of the monitored data • Plots any kind of chart• Interfaces to user code
(custom consumers, config modules, new charts, distributions)
MonALISAAgent
WNs
CEBash monitoring script
Java interface class
Monitored data
User code MonALISA frameworkALICE’s resources
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 8/18
Repository• Additional Java thread to feed directly the repository
Ad hoc java thread
Monitored data
TOMCATJSP/servlets
AliEn Jobs Monitoring• If the Grid executes jobs then it works!• Centralized or distributed?• AliEn native APIs to retrieve job status snapshots
Job is submitted
>1h
>3h
(Error_I)
(Error_A)
(Error_S)
(Error_E)
(Error_R)
(Error_V, VT, VN)
(Error_SV)
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 9/18
Repository DataBase(s)• 7.5 Gb of monitored information, 52M records• During DCs data from ~2K monitored parameters arrive every 2/3 mins
alimonitor.cern.chaliweb01.cern.ch
Online Replication
• Data Replication:MASTER DB SPARE DB
Grid AnalysisData collecting and Grid Monitoring
1min
Averaging
process
10 min 100 min
60 bins for each basic
information
FIFO
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 10/18
Source Category Number Examples
AliEn API CE load factors 63 Run load, queue load
SE occupancy 62 Used space, free space, files number
Job information 557 Running, saving, done, failed
Soap calls CERN Network traffic 29 MBs, files
LCG CPU – Jobs 48 Free CPUs, job running and waiting
ML services on MQ Job summary 34 Running, saving, done, failed
AliEn parameters 15 MySQL load, Perl processes
ML services Sites info 1060 Paging, threads, I/O, processes
Job execution efficiency Successfuly done jobs / all submitted jobs
System efficiency Error (CE) free jobs / all submitted jobs
AliRoot efficiency Error (AliROOT) free jobs / all submitted jobs
Resource efficiency Running (queued) jobs / max_running (queued)
Monitored parameters
Derived classes…
1868
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 11/18
Extensions
• Job monitoring by user
• Repository Web Services• Application Monitoring (ApMon) at WNs• Grid Analysis
•AliEn “ps –xxx” commands•Job’s JDL•Results presented in the same web front end
•Repository interfaced to ROOT and Carrot
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 12/18
Section III
PDC 2004 Monitoring and Results
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 13/18
•Start 10/03, end 29/05 (58 days active)•Maximum jobs running in parallel: 1450•Average during active period: 430
Sum of all sites
Phase 1 (simulation)
Successfully done jobs all submitted
jobs
Error (CE) free jobs
all submitted
jobs
Error (AliROOT) free jobs all submitted
jobs
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 14/18
Phase 2 (merging)
as in the 1st phase, general equilibrium in CPU contribution not sigle site dominating the production jobs successfully done 76% AliEn, 24% LCG
Jobs failure Reason Rate
Submission CE scheduler not responding 1%
Loading input data Remote SE not responding 3%
During execution Job aborted, not started, killed, WN malfunction 10%
Saving output data Local SE not responding 2%
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 15/18
Phase 3 (analysis)
• Occupancy changes respect the number of queued jobs in the local batch system
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 17/18
Credits• Federico, Predrag and Peter
they could pick up another TS
• Latchezarcontinuos help and suggestions, review of my thesis
• MonALISA teamcollaborative anytime I needed
• Guentervery useful integrations
• my fianceemoral support: “did they hire you just to look at some plots?”