progress on hpc’s (or building “hpc factory” at anl) doug benjamin duke university
Post on 18-Jan-2016
213 Views
Preview:
TRANSCRIPT
Progress on HPC’s(or building “HPC factory”
at ANL)Doug BenjaminDuke University
Introduction
• People responsible for this effort:o Tom LeCompte (ANL –HEP division)o Tom Uram (ANL – ALCF – MCS division)o Doug Benjamin (Duke University)
• Work supported by US ATLAS
• Development activities performed at Argonne National Lab Leader Class Facilities (ALCF)o Have a director’s computing allocation for this work
• US ATLAS members have allocations at other DOE Leadership Class Facilities (HPC sites)o NERSC, Oak Ridge, BNL
Goals for this effort
• Develop a simple and robust system • Scalable
o Run on many different HPC siteso How else can we achieve “world domination”? o Seriously – running at many different sites maximizes the benefit to
ATLAS.
• Transferableo Goal is to make this system deployable/usable by many different people
including HEP faculty – not just the enterprising students and computer professionals
• Work with existing HEP workload management systems (ie PanDA)
• Reuse existing code base where ever possibleo For Example – use the existing PanDA pilot with small tweeks for HPC
• Transparento If you know how to use PanDA then you know how to use HPC via
Balsam
HPC Boundary conditions• There are many scientific HPC machines across the US and the
world.o Need to design system that is general enough to work on
many different machines• Each machine is independent of other
o The “grid” side of the equation must aggregate the information
• There several different machine architectureso ATLAS jobs will not run unchanged on many of the machineso Need to compile programs for each HPC machineo Memory per node (each with multiple cores) varies from
machine to machine• The computational nodes typically do not have connectivity to
the Internet o connectivity is through login node/edge machineo Pilot jobs typically can not run directly on the computational
nodeso TCP/IP stack missing on many computational nodes
Additional HPC issues
• Each HPC machine has its own job management system
• Each HPC machine has its own identity management system
• Login/Interactive nodes have mechanisms for fetching information and data files
• HPC computational nodes are typically MPI • Can get a large number of nodes• The latency between job submission and
completion can be variable. (Many other users)• We have think how we can adapt to the HPC’s and
not how the HPC’s can adapt to us? o It will give us more opportunities. -
Work Flow
• Some ATLAS simulation jobs can be broken up into 3 components
1. Preparatory phase - Make the job ready for HPCo For example - generate computational grid for Alpgeno Fetch Database files for Simulation o Transfer input files to HPC system
2. Computational phase – can be done on HPCo Generate eventso Simulate events
3. Post Computational phase (Cleanup)o Collect output files (log files, data files) from HPC jobso Verify outputo Unweight (if needed) and merge files
Current Prototype - (“Grid side”) Infrastructure
• APF Pilot factory to submit pilots• Panda queue – currently testing an ANALY QUEUE• Local batch system• Web server provides steering XML files to HPC
domain• Message Broker system to exchange information
between Grid Domain and HPC domain• Gridftp server to transfer files between HTC
domain and HPC domain. o Globus Online might be a good solution here (what are
the costs?)
• ATLAS DDM Site - SRM and Gridftp server(s).
This is a working system
HPC code stack“BALSAM”
"Be not distressed, friend," said Don Quixote, "for I will now make the precious balsam with which we shall cure ourselves in the twinkling of an eye."• All credit goes to Tom Uram - ANL• Work on HPC side is performed by two components
o Service: Interacts with message broker to retrieve job descriptions, saves jobs in local database, notifies message broker of job state changes
o Daemon: Stages input data from HTC GridFTP server, submits job to queue, monitors progress of job, and stages output data to HTC GridFTP server
• Service and Daemon are built in Python, using the Django Object Relational Mapper (ORM) to communicate with the shared underlying databaseo Django is a stable, open-source project with an active communityo Django supports several database backends
• Current implementation relies on GridFTP for data transfer and the ALCF Cobalt scheduler
• Modular design enables future extension to alternative data transfer mechanisms and schedulers
Some BALSAM details
• Code runs in user space – has a “daemon” mode run by user
• Written in python using the virtualenv system to encapsulate the python environment
• Requires python 2.6 or later (not tested with v3.0 yet)o Adding additional batch queues like condor requires some code
factorization by a non – expert (ie me)
• Incorporating code and ideas from other projects – ie autopyfactory by John Hover and Jose Cabelleroo Some of their condor bits and likely the proxy bits
• Can run outside of HPC – like on my MAC (or linux machines at work and home)o Useful for code development
• Work proceeding on robust error handling and good documentation
Moving to “HPC factory”
• Why not have the HPC code startup the Panda pilots?
• Panda pilots handle the communication between Panda system and HPC jobs through communication with the BALSAM system.
• Standard ATLAS pilot code with some tweaks can be used in HPC system. o Modularity of the pilot helps here
• The APF can be used as a guide.
Status of porting pilot to HPC
• Using the VESTA machine at ANLo Current generation HPC (on rack equivalent of the recently dedicated
MIRA HPC)
• Python 2.6 running on login node• Pilot code starts up and begins to run • … but… terminates early due to missing links to
ATLAS code area• Straight forward to fix (need to predefine
environmental variables and create needed files)• Expect other types of issues.• Need to agree on how the pilot will get information
from the BALSAM system and vise-versa
Other ATLAS code changes
• Given that many of the Leadership machines in US are not x86-64 based processors will need to modify the ATLAS transforms.
• New transform routines “Gen-TF” will be needed to run on these machines.o Effort might exist at ANL to do some of this worko New ANL scientist hire o New US ATLAS Graduate Fellow some time available for HPC code work
• Transformation code work part of a larger effort to be able to run on more heterogeneous architectures.
Where we could use help
• BALSAM code will make good basis for HPC factory, flexible enough to run on many different HPC
• Need help to make extensions to the existing pilot code
• Need help understanding what information pilot needs and where it gets it – so we use the same pilot code as everyone else.
• Need help with writing the new transforms.
• We like have the existing worker bees just need consultant help
Open issues for a production system
• Need a federated Identity managemento Grid identify system is not used in HPC domaino Need to strictly regulate who can run on HPC machines
• Security-Security (need I say more)• What is the proper scale for the Front-End grid
cluster?o Now many nodes are needed?o How much data needs to be merged?
Conclusions
• Many ATLAS MC jobs can be divided into a Grid (HTC) component and a HPC component
• Have demonstrated that using existing ATLAS tools that we can design and build a system to send jobs from grid to HPC and back to Grid
• Modular design of all components makes it easier to add new HPC sites and clone the HTC side if needed for scaling reasons.
• A lightweight yet powerful system is being developed – the beginning of “HPC factory”
Extra slides
Message Broker system
• System must have large community support beyond just HEP
• Solution must be open source (Keeps Costs manageable)
• Message Broker system must have good documentation
• Scalable• Robust• Secure• Easy to use• Must use a standard protocol (AMQP 0-9-1 for
example)• Clients in multiple languages (like JAVA/Python)
RabbitMQ message broker• ActiveMQ and RabbitMQ evaluated. • Google analytics shows both are equally popular
• Bench mark measurements show that RabbitMQ server out performs ActiveMQ
• Found it easier to handle message routing and our work flow
• Pika python client easy to use.
Basic Message Broker design
• Each HPC has multiple permanent durable queues.o One queue per activity on HPC o Grid jobs send messages to HPC machines through these queues o Each HPC will consume messages from these queueso Routing string is used to direct message to the proper place
• Each Grid Job will have multiple durable queueso One queue per activity (Step in process)o Grid job creates the queues before sending any message to HPC queueso On completion of grid job job queues are removedo Each HPC cluster publishes message to these queues through an
exchangeo Routing string is used to direct message to the proper placeo Grid jobs will consume messages the messages only on its queues.
• Grid domains and HPC domains have independent polling loops
• Message producer and Client code needs to be tweaked for additional robustness
top related