monitoring system basics, and adding support for a new...

44
Mitglied der Helmholtz-Gemeinschaft Monitoring system basics, and adding support for a new batch system September 19, 2012 Carsten Karbach and Wolfgang Frings

Upload: others

Post on 16-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Monitoring system basics,and adding support fora new batch system

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 2: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Content

1 PTP System Monitoring2 Architecture3 Large-scale system Markup Language (LML)4 Implementation5 Scalability6 Add new target system7 Conclusion

September 19, 2012 Carsten Karbach and Wolfgang Frings 2 40

Page 3: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part I: PTP System Monitoring

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 4: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Scope

job and system monitoring of large-scale supercomputersmonitoring of multiple target systems in one perspectivesupport for many batch systems (Grid Engine,LoadLeveler, Open MPI, PBS, Slurm, Torque)overview of the system on a single screenbased on monitoring application LLview

September 19, 2012 Carsten Karbach and Wolfgang Frings 4 40

Page 5: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Perspective

September 19, 2012 Carsten Karbach and Wolfgang Frings 5 40

Page 6: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Perspective

September 19, 2012 Carsten Karbach and Wolfgang Frings 5 40

Page 7: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Perspective

September 19, 2012 Carsten Karbach and Wolfgang Frings 5 40

Page 8: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Perspective

September 19, 2012 Carsten Karbach and Wolfgang Frings 5 40

Page 9: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

PTP Monitoring Perspective

September 19, 2012 Carsten Karbach and Wolfgang Frings 5 40

Page 10: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

1. PTP System Monitoring

Monitoring Views

Nodes View renders target system architecture,maps jobs to compute resourcesActive Jobs View lists running jobsInactive Jobs View lists queued jobsMonitoring View selects active target system,starts/stops monitoringMessage View shows message of the day

September 19, 2012 Carsten Karbach and Wolfgang Frings 6 40

Page 11: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part II: Architecture

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 12: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

2. Architecture

Monitoring Architecture I

September 19, 2012 Carsten Karbach and Wolfgang Frings 8 40

Page 13: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

2. Architecture

Monitoring Architecture II

LML da gathers status information,calls target system’s remote commands, written in PerlLML is a data format for status informationof supercomputersLML request: contains table filtering information,visible/hidden columnsLML response: contains the request and status informationclient stores current layout requestfor successive Eclipse sessions

September 19, 2012 Carsten Karbach and Wolfgang Frings 9 40

Page 14: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

2. Architecture

Communications Protocol

September 19, 2012 Carsten Karbach and Wolfgang Frings 10 40

Page 15: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part III: Large-scale system MarkupLanguage (LML)

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 16: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

3. Large-scale system Markup Language (LML)

Large-scale system Markup Language (LML)

markup language for description of supercomputer’s statusinterface between LML da and visualization clientsdescribes all graphical components available in LLview(job-list, node display, charts ...)logical and system independent descriptionof current statusclose to graphical representation→ thin visualization clientsimplemented in XML, validation against XML-Schema

September 19, 2012 Carsten Karbach and Wolfgang Frings 12 40

Page 17: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

3. Large-scale system Markup Language (LML)

LML Structure

Global data: intermediate data format, scheduling objectsGraphical components:data for visualization componentsLayout: hints for visualization, node display hierarchy

September 19, 2012 Carsten Karbach and Wolfgang Frings 13 40

Page 18: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

3. Large-scale system Markup Language (LML)

Example node display

LML-description of a node display

<nodedisplay title="Cluster" id="1">

<scheme >

<el1 tagname="Node" min="1" max="4">

<el2 tagname="CPU" min="1" max="3"/>

</el1>

</scheme >

<data>

<el1 min="1" max="2" oid="job1">

<el2 min="3" oid="job2"/>

</el1>

<el1 min="3" max="4" oid="job2"/>

</data>

</nodedisplay >

September 19, 2012 Carsten Karbach and Wolfgang Frings 14 40

Page 19: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part IV: Implementation

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 20: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

4. Implementation

Plug-in Overview

September 19, 2012 Carsten Karbach and Wolfgang Frings 16 40

Page 21: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

4. Implementation

Plug-in Details I

lml.core

uses JAXB to manage LML filesprovides helper classes for comfortable access on LMLdefines events and listeners for all UI actions

lml.ui

implements a ViewPart for Nodes / Jobs and Info Viewsrenders LML components, considers LML layout hintsUI implementation:

Jobs View → JFace TreeViewerNodes View → nests Composites, PaintListenerInfo View → SWT Text / Table

September 19, 2012 Carsten Karbach and Wolfgang Frings 17 40

Page 22: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

4. Implementation

Plug-in Details II

lml.da

set of Perl scripts → simple installationsplit into independent modules:1 Driver calls system specific commands to get status infos2 Combiner merges LML files generated by the Driver3 AddColor assigns unique color to each job4 LML2LML converts intermediate format to final LML

fully configurable workflowonly Driver scripts and workflow description aretarget system specific

September 19, 2012 Carsten Karbach and Wolfgang Frings 18 40

Page 23: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part V: Scalability

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 24: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Node Display – Compression

September 19, 2012 Carsten Karbach and Wolfgang Frings 20 40

Page 25: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Node Display – CompressionAttribute Inheritance

September 19, 2012 Carsten Karbach and Wolfgang Frings 21 40

Page 26: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Node Display – Compression

Schemedefines architecture of empty system

September 19, 2012 Carsten Karbach and Wolfgang Frings 22 40

Page 27: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Node Display – CompressionRanges

Result→ 3 objects instead of 16

September 19, 2012 Carsten Karbach and Wolfgang Frings 23 40

Page 28: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Scalable Visualization – Row Level

September 19, 2012 Carsten Karbach and Wolfgang Frings 24 40

Page 29: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Scalable Visualization – Nodeboard Level

September 19, 2012 Carsten Karbach and Wolfgang Frings 25 40

Page 30: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Scalable Visualization – CPU Node Level

September 19, 2012 Carsten Karbach and Wolfgang Frings 26 40

Page 31: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

5. Scalability

Table Filtering

scalability of job data displayfiltering by specification ofattribute values / rangesserver- and client-side filtering

September 19, 2012 Carsten Karbach and Wolfgang Frings 27 40

Page 32: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part VI: Add new target system

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 33: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

LML da – Data Extraction

Which status data is required?

jobs: owner, queue, dispatch date, state ...nodes: memory, cores, state, GPUs ...system: hostname, date, high messages ...

How is status data extracted?

one Perl script per data typecall batch system commands(e.g. qstat and pbsnodes for TORQUE)convert output into intermediate LML files

September 19, 2012 Carsten Karbach and Wolfgang Frings 29 40

Page 34: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

LML da – add new target system

1 write scripts for extracting status information e.g.da jobs info LML.pl gathers jobsda nodes info LML.pl gathers nodesda system info LML.pl gets global system information

2 write functions for checking remote commands and forcomposing the workflow → da check info LML.pl

3 put all scripts into a folder in lml.da/rms,folder named after target system

4 add lml.monitor.ui.monitors extensionfor the new system type

September 19, 2012 Carsten Karbach and Wolfgang Frings 30 40

Page 35: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

Client – add new target system

LML da generates system independent LML filesclient visualizes new target systems without any changesbut a system specific LML layout can be configuredcurrently this layout configuration is tricky

September 19, 2012 Carsten Karbach and Wolfgang Frings 31 40

Page 36: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

LML Layout Configuration I

the following procedure is only meant for testingclient side configuration of LML layout is plannedlong-term target:configure entire layout via GUI, not via XML filethe layout could be saved together withthe target system configuration

September 19, 2012 Carsten Karbach and Wolfgang Frings 32 40

Page 37: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

LML Layout Configuration II

1 write your own LML layout file, or adjust an existing(examples in lml.da plug-in)

2 start the monitoring connection to the target system3 log-in on the remote machine and switch

to the .eclipsesettings folder in your home directory4 create the file .LML da options with content

nocheckrequest=15 replace samples/layout default.xml in .eclipsesettings

with your own layout6 refresh the monitoring perspective on your client

September 19, 2012 Carsten Karbach and Wolfgang Frings 33 40

Page 38: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

Layout Example – Default

September 19, 2012 Carsten Karbach and Wolfgang Frings 34 40

Page 39: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

Layout Example – Custom I

September 19, 2012 Carsten Karbach and Wolfgang Frings 35 40

Page 40: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

6. Add new target system

Layout Example – Custom II

September 19, 2012 Carsten Karbach and Wolfgang Frings 36 40

Page 41: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Part VII: Conclusion

September 19, 2012 Carsten Karbach and Wolfgang Frings

Page 42: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

7. Conclusion

Conclusion

5 interacting views form the monitoring perspectiveLML as data interface between LML da and clientLML is divided into global data,graphical components and layoutcomposition of 6 Eclipse plug-ins(lml.core/ui, monitor.core/ui, lml.da/lml.da.server)new target system:write data extraction scripts + configure layoutScalability

scalable data acquisitionscalable data formatscalable presentation

September 19, 2012 Carsten Karbach and Wolfgang Frings 38 40

Page 43: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

7. Conclusion

Questions?

September 19, 2012 Carsten Karbach and Wolfgang Frings 39 40

Page 44: Monitoring system basics, and adding support for a new ...wiki.eclipse.org/images/d/d0/...Karbach_Frings.pdf · September 19, 2012 Carsten Karbach and Wolfgang Frings. Content 1 PTP

7. Conclusion

Contact

E-mail:[email protected], [email protected] → http://llview.zam.kfa-juelich.de/LML

LLview → http://www.fz-juelich.de/jsc/llview

September 19, 2012 Carsten Karbach and Wolfgang Frings 40 40