progress on hpc’s (or building “hpc factory” at anl) doug benjamin duke university

Progress on HPC’s(or building “HPC factory”

at ANL)Doug BenjaminDuke University

Introduction

• People responsible for this effort:o Tom LeCompte (ANL –HEP division)o Tom Uram (ANL – ALCF – MCS division)o Doug Benjamin (Duke University)

• Work supported by US ATLAS

• Development activities performed at Argonne National Lab Leader Class Facilities (ALCF)o Have a director’s computing allocation for this work

• US ATLAS members have allocations at other DOE Leadership Class Facilities (HPC sites)o NERSC, Oak Ridge, BNL

Goals for this effort

• Develop a simple and robust system • Scalable

o Run on many different HPC siteso How else can we achieve “world domination”? o Seriously – running at many different sites maximizes the benefit to

ATLAS.

• Transferableo Goal is to make this system deployable/usable by many different people

including HEP faculty – not just the enterprising students and computer professionals

• Work with existing HEP workload management systems (ie PanDA)

• Reuse existing code base where ever possibleo For Example – use the existing PanDA pilot with small tweeks for HPC

• Transparento If you know how to use PanDA then you know how to use HPC via

Balsam

HPC Boundary conditions• There are many scientific HPC machines across the US and the

world.o Need to design system that is general enough to work on

many different machines• Each machine is independent of other

o The “grid” side of the equation must aggregate the information

• There several different machine architectureso ATLAS jobs will not run unchanged on many of the machineso Need to compile programs for each HPC machineo Memory per node (each with multiple cores) varies from

machine to machine• The computational nodes typically do not have connectivity to

the Internet o connectivity is through login node/edge machineo Pilot jobs typically can not run directly on the computational

nodeso TCP/IP stack missing on many computational nodes

Additional HPC issues

• Each HPC machine has its own job management system

• Each HPC machine has its own identity management system

• Login/Interactive nodes have mechanisms for fetching information and data files

• HPC computational nodes are typically MPI • Can get a large number of nodes• The latency between job submission and

completion can be variable. (Many other users)• We have think how we can adapt to the HPC’s and

not how the HPC’s can adapt to us? o It will give us more opportunities. -

Work Flow

• Some ATLAS simulation jobs can be broken up into 3 components

1. Preparatory phase - Make the job ready for HPCo For example - generate computational grid for Alpgeno Fetch Database files for Simulation o Transfer input files to HPC system

2. Computational phase – can be done on HPCo Generate eventso Simulate events

3. Post Computational phase (Cleanup)o Collect output files (log files, data files) from HPC jobso Verify outputo Unweight (if needed) and merge files

Current Prototype - (“Grid side”) Infrastructure

• APF Pilot factory to submit pilots• Panda queue – currently testing an ANALY QUEUE• Local batch system• Web server provides steering XML files to HPC

domain• Message Broker system to exchange information

between Grid Domain and HPC domain• Gridftp server to transfer files between HTC

domain and HPC domain. o Globus Online might be a good solution here (what are

the costs?)

• ATLAS DDM Site - SRM and Gridftp server(s).

This is a working system

HPC code stack“BALSAM”

"Be not distressed, friend," said Don Quixote, "for I will now make the precious balsam with which we shall cure ourselves in the twinkling of an eye."• All credit goes to Tom Uram - ANL• Work on HPC side is performed by two components

o Service: Interacts with message broker to retrieve job descriptions, saves jobs in local database, notifies message broker of job state changes

o Daemon: Stages input data from HTC GridFTP server, submits job to queue, monitors progress of job, and stages output data to HTC GridFTP server

• Service and Daemon are built in Python, using the Django Object Relational Mapper (ORM) to communicate with the shared underlying databaseo Django is a stable, open-source project with an active communityo Django supports several database backends

• Current implementation relies on GridFTP for data transfer and the ALCF Cobalt scheduler

• Modular design enables future extension to alternative data transfer mechanisms and schedulers

Some BALSAM details

• Code runs in user space – has a “daemon” mode run by user

• Written in python using the virtualenv system to encapsulate the python environment

• Requires python 2.6 or later (not tested with v3.0 yet)o Adding additional batch queues like condor requires some code

factorization by a non – expert (ie me)

• Incorporating code and ideas from other projects – ie autopyfactory by John Hover and Jose Cabelleroo Some of their condor bits and likely the proxy bits

• Can run outside of HPC – like on my MAC (or linux machines at work and home)o Useful for code development

• Work proceeding on robust error handling and good documentation

Moving to “HPC factory”

• Why not have the HPC code startup the Panda pilots?

• Panda pilots handle the communication between Panda system and HPC jobs through communication with the BALSAM system.

• Standard ATLAS pilot code with some tweaks can be used in HPC system. o Modularity of the pilot helps here

• The APF can be used as a guide.

Status of porting pilot to HPC

• Using the VESTA machine at ANLo Current generation HPC (on rack equivalent of the recently dedicated

MIRA HPC)

• Python 2.6 running on login node• Pilot code starts up and begins to run • … but… terminates early due to missing links to

ATLAS code area• Straight forward to fix (need to predefine

environmental variables and create needed files)• Expect other types of issues.• Need to agree on how the pilot will get information

from the BALSAM system and vise-versa

Other ATLAS code changes

• Given that many of the Leadership machines in US are not x86-64 based processors will need to modify the ATLAS transforms.

• New transform routines “Gen-TF” will be needed to run on these machines.o Effort might exist at ANL to do some of this worko New ANL scientist hire o New US ATLAS Graduate Fellow some time available for HPC code work

• Transformation code work part of a larger effort to be able to run on more heterogeneous architectures.

Where we could use help

• BALSAM code will make good basis for HPC factory, flexible enough to run on many different HPC

• Need help to make extensions to the existing pilot code

• Need help understanding what information pilot needs and where it gets it – so we use the same pilot code as everyone else.

• Need help with writing the new transforms.

• We like have the existing worker bees just need consultant help

Open issues for a production system

• Need a federated Identity managemento Grid identify system is not used in HPC domaino Need to strictly regulate who can run on HPC machines

• Security-Security (need I say more)• What is the proper scale for the Front-End grid

cluster?o Now many nodes are needed?o How much data needs to be merged?

Conclusions

• Many ATLAS MC jobs can be divided into a Grid (HTC) component and a HPC component

• Have demonstrated that using existing ATLAS tools that we can design and build a system to send jobs from grid to HPC and back to Grid

• Modular design of all components makes it easier to add new HPC sites and clone the HTC side if needed for scaling reasons.

• A lightweight yet powerful system is being developed – the beginning of “HPC factory”

Extra slides

Message Broker system

• System must have large community support beyond just HEP

• Solution must be open source (Keeps Costs manageable)

• Message Broker system must have good documentation

• Scalable• Robust• Secure• Easy to use• Must use a standard protocol (AMQP 0-9-1 for

example)• Clients in multiple languages (like JAVA/Python)

RabbitMQ message broker• ActiveMQ and RabbitMQ evaluated. • Google analytics shows both are equally popular

• Bench mark measurements show that RabbitMQ server out performs ActiveMQ

• Found it easier to handle message routing and our work flow

• Pika python client easy to use.

Basic Message Broker design

• Each HPC has multiple permanent durable queues.o One queue per activity on HPC o Grid jobs send messages to HPC machines through these queues o Each HPC will consume messages from these queueso Routing string is used to direct message to the proper place

• Each Grid Job will have multiple durable queueso One queue per activity (Step in process)o Grid job creates the queues before sending any message to HPC queueso On completion of grid job job queues are removedo Each HPC cluster publishes message to these queues through an

exchangeo Routing string is used to direct message to the proper placeo Grid jobs will consume messages the messages only on its queues.

• Grid domains and HPC domains have independent polling loops

• Message producer and Client code needs to be tweaked for additional robustness

progress on hpc’s (or building “hpc factory” at anl) doug benjamin duke university

Documents