february 2006 iosif legrand 1 iosif legrand california institute of technology february 2006...
TRANSCRIPT
February 2006 Iosif Legrand1
Iosif LegrandIosif LegrandCalifornia Institute of Technology
February 2006February 2006
An Agent Based, Dynamic Service System to Monitor,An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed SystemsControl and Optimize Distributed Systems
February 2006 Iosif Legrand2
The MonALISA Framework
MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.
The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.
February 2006 Iosif Legrand3
MonALISA is A Dynamic, Distributed Service Architecture
The framework is based on a hierarchical structure of loosely coupled agents acting as distributed services which are independent & autonomous entities able to discover themselves and to cooperate using a dynamic set of proxies or self describing protocols.
An agent-based architecture provides the ability to invest the system with increasing degrees of intelligence; to reduce complexity and make global systems manageable in real time. For an effective use of distributed resources, these services provide adaptability and self-organization.
February 2006 Iosif Legrand4
LookupService
MonALISA service & Data HandlingMonALISA service & Data Handling
Data CacheService & DB
Configuration Control (SSL)Configuration Control (SSL)
LookupServiceData StoresWEB
Service
WSDLSOAP
Client(other service)
Java
Discovery
Registratio
nClient
(other service) Web client
data
Postgres MySQL
Applications
User defined loadable Modules to write /sent data
Predicates & Agents
Communications via the ML Proxy
MonALSIA Service
February 2006 Iosif Legrand5
The MonALISA Discovery System & ServicesThe MonALISA Discovery System & Services
Network of JINI-LUSsNetwork of JINI-LUSsSecure & Public Secure & Public
MonALISA servicesMonALISA services
ProxiesProxies
Clients , HL servicesClients , HL servicesrepositoriesrepositories
Distributed Dynamic Distributed Dynamic Discovery- based on a lease Discovery- based on a lease Mechanism and REN Mechanism and REN
Distributed System Distributed System for gathering and for gathering and Analyzing InformationAnalyzing Information..
Dynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationSecuritySecurity AAA for Clients AAA for Clients
Global Services orGlobal Services orClientsClients
Fully Distributed System with no Single Point of Failure
AGENTS
February 2006 Iosif Legrand6
Monitoring Internet2 backbone NetworkMonitoring Internet2 backbone Network
Test for a Land Speed Record Test for a Land Speed Record ~ 7 Gb/s in a single TCP stream ~ 7 Gb/s in a single TCP stream
from Geneva to Caltechfrom Geneva to Caltech
February 2006 Iosif Legrand7
The UltraLight Network
BNL ESnet IN /OUT
February 2006 Iosif Legrand8
Monitoring Network Topology Monitoring Network Topology Latency, RoutersLatency, Routers
NETWORKS
AS
ROUTERS
February 2006 Iosif Legrand9
Monitoring The GLORIAD RingMonitoring The GLORIAD Ring
February 2006 Iosif Legrand10
Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity
TOPOLOGY
JOBS
ACCOUNTING
February 2006 Iosif Legrand11
Monitoring OSG: Resources, Jobs & AccountingMonitoring OSG: Resources, Jobs & Accounting
42 SITES 42 SITES ~ 4 000 Nodes ( 10 000 CPUs) ~ 4 000 Nodes ( 10 000 CPUs) Thousands of Jobs Thousands of Jobs 60 000 parameters60 000 parameters
Running Jobs Accounting
February 2006 Iosif Legrand12
FTP Data Transfer between GRID sitesFTP Data Transfer between GRID sites
Total FTP Traffic per VO
February 2006 Iosif Legrand13
Bandwidth Challenge at SC2005
151 Gbs
~ 500 TB Total in 4h
February 2006 Iosif Legrand14
End User / Client AgentLISA- Localhost Information Service AgentLISA- Localhost Information Service Agent
Authorization Service discovery Local detection of the hardware and software configuration Complete end-system monitoring: Per-process load, I/O and
network throughputs, etc. End-to-end performance measurements Will act as an active listener for all events related with the requests generated
by its local applications.
February 2006 Iosif Legrand15
Host Monitoring at SC2005Host Monitoring at SC2005
Many “network” problems are actually endhost problems: Many “network” problems are actually endhost problems: misconfigured or underpowered end-systemsmisconfigured or underpowered end-systems
The LISA application was designed to monitor the The LISA application was designed to monitor the endhost and its view of the network.endhost and its view of the network.
For SC|05 we developed we used LISA to gather the For SC|05 we developed we used LISA to gather the relevant host details related to network performance relevant host details related to network performance
Information on the system information, TCP configuration Information on the system information, TCP configuration and network device setup was gathered and accessible and network device setup was gathered and accessible from one site.from one site.
Future plans are to coordinate this with LISA and deploy Future plans are to coordinate this with LISA and deploy this as part of OSG. The Tier-2 centers are a primary this as part of OSG. The Tier-2 centers are a primary target.target.
Network Device InformationTCP SettingsHost/System Information
February 2006 Iosif Legrand16
Available Bandwidth MeasurementsAvailable Bandwidth Measurements
Embedded Pathload module.Embedded Pathload module.
February 2006 Iosif Legrand17
Coordination Service for Available Coordination Service for Available Bandwidth MeasurementsBandwidth Measurements
Enforces measurement fairnessEnforces measurement fairness Avoids multiple probes on shared network segmentsAvoids multiple probes on shared network segments Dynamic Dynamic
configuration of configuration of measurements measurements timingtiming
Logs eventsLogs events Provides service Provides service
redundancy by redundancy by using a master-using a master-slave modelslave model
February 2006 Iosif Legrand18
Monitoring the Execution of JobsMonitoring the Execution of Jobs and the Time Evolution and the Time Evolution
SPLIT JOBSSPLIT JOBS
LIFELINES for JOBS
Job Job
Job1
Job2
Job3
Job31
Job32
Summit a Job
DAG
February 2006 Iosif Legrand19
ApMon – Application Monitoring
MonALISAService
MonALISAService
ApMon
ApMon
APPLICATION
APPLICATION
MonitoringData
UDP/XDR
Mbps_out: 0.52 Status: reading
App. Monitoring
MB_inout: 562.4
ApMonConfig
parameter1: value parameter2: value
App. Monitoring
...
Time;IP;procIDMonitoring
Data
UDP/XDR
MonitoringData
UDP/XDR
load1: 0.24 processes: 97
System Monitoring
pages_in: 83
MonALISA
hosts
Config Servlet
Library of APIs (C, C++, Java, Perl. Python) that can be used to send any information to MonALISA services
Flexibility, dynamic configuration, high communication performancedynamic reloading
ApMon configuration generated automatically by a servlet / CGI script
Automated system monitoring
Accounting information
0
10
20
30
40
50
60
70
0 1000 2000 3000 4000 5000 6000
Messages per second
MonALISA CPU Usage (%)No Lost Packages
February 2006 Iosif Legrand20
Optical Switch
Runs a ML Demon Runs a ML Demon
>>ml_path IP1 IP4 “copy file IP4”ml_path IP1 IP4 “copy file IP4”
ML proxy servicesML proxy servicesused in Agent Communicationused in Agent Communication
ML Demon ML Demon
Control and Control and Monitor the Monitor the switchswitch
Optical Switch
Optical Switch
MonALISAML Agent
MonALISAML Agent
MonALISAML Agent
2
1
3
Discovery &Secure Connection
4
MonALISA agents to create on demand MonALISA agents to create on demand on an optical path or treeon an optical path or tree
Time to create a Time to create a path on demand path on demand <1s independent <1s independent of the location of the location and the number and the number of connectionsof connections
February 2006 Iosif Legrand21
Monitoring and Controlling Optical Planes
Port power monitoring
Controlling
February 2006 Iosif Legrand22
Monitoring Optical Switches Monitoring Optical Switches Agents to Create on Demand an Optical PathAgents to Create on Demand an Optical Path
February 2006 Iosif Legrand23
Major Communities OSG CMS ALICE D0 STAR VRVS LGC RUSSIA SE Europe GRID APAC Grid UNAM Grid
ABILENE ULTRALIGHT GLORIAD LHC Net RoEduNET
Communities using MonALISACommunities using MonALISA
ABILENEABILENE
VRVSVRVS
--
ALICE
CMS-DC04CMS-DC04
Demonstrated at:
SC2003
Telecom World 2003
WSIS 2003
SC 2004
I2 2005
TERENA 2005
IGrid 2005
SC 2005
MonALISARunning 24 X 7
at 250 SitesCollecting 250,000
parameters in near real-time
Update rate of 25,000 parameter updates per second
Monitoring12,000 computers > 100 WAN Links
Thousands of Grid jobs running con- currently
February 2006 Iosif Legrand24
The MonALISA Architecture Provides: Distributed Distributed Registration and DiscoveryRegistration and Discovery for Services and Applications. for Services and Applications.
Monitoring all aspects of complex systems :Monitoring all aspects of complex systems : System information for computer nodes and clusters System information for computer nodes and clusters Network information : WAN and LAN Network information : WAN and LAN Monitoring the performance of Applications, Jobs or services Monitoring the performance of Applications, Jobs or services The End User Systems, its performance The End User Systems, its performance Video streaming Video streaming
Can Can interact with any other servicesinteract with any other services to provide in near real-time customized to provide in near real-time customized information based on monitoring datainformation based on monitoring data
Secure, remote Secure, remote administrationadministration for services and applications for services and applications
Agents to supervise applicationsAgents to supervise applications, trigger alarms, restart or reconfigure , trigger alarms, restart or reconfigure them, and to notify other services when certain conditions are detected.them, and to notify other services when certain conditions are detected.
The MonALISA framework is used The MonALISA framework is used to develop higher level decision servicesto develop higher level decision services, , implemented as a distributed network of communicating agents, to perform implemented as a distributed network of communicating agents, to perform global optimization tasks. global optimization tasks.
Graphical User InterfacesGraphical User Interfaces to visualize complex information to visualize complex information