batch software at jlab ian bird jefferson lab chep2000 7-11 february, 2000
DESCRIPTION
Ian Bird / Jefferson LabCHEP Environment Computing facilities were designed to: –Handle data rate of close to 1 TB/day –1 st level reconstruction only (2 passes) Match average data rate –Some local analysis but mainly export of vastly reduced summary DSTs Originally estimated requirements: –~ 1000 SI95 –3 TB online disk –300 TB tape storage – 8 RedWood drivesTRANSCRIPT
Batch Software at JLAB
Ian BirdJefferson Lab
CHEP20007-11 February, 2000
Ian Bird / Jefferson Lab CHEP 2000 2
Introduction
• Environment– Farms– Data flows– Software
• Batch systems– JLAB software– LSF vs. PBS
• Scheduler
• Tape software– File pre-staging/caching
Ian Bird / Jefferson Lab CHEP 2000 3
Environment
• Computing facilities were designed to:– Handle data rate of close to 1 TB/day– 1st level reconstruction only (2 passes)
• Match average data rate – Some local analysis but mainly export of vastly
reduced summary DSTs• Originally estimated requirements:
– ~ 1000 SI95– 3 TB online disk– 300 TB tape storage – 8 RedWood drives
Ian Bird / Jefferson Lab CHEP 2000 4
Environment - real• After 1 year of production running of CLAS (largest
experiment)– Detector is far cleaner than anticipated, which means:
• Data volume is less ~ 500 GB/day• Data rate is 2.5x anticipated (2.5 kHz)• Fraction of good events larger• DST sizes are same as Raw data (!)
– Per event processing time is much longer than original estimates– Most analysis is done locally – no-one is really interested in
huge data exports• Other experiments also have large data rates (for short
periods)
Ian Bird / Jefferson Lab CHEP 2000 5
Computing implications
• CPU requirement is far greater– Current farm is 2650 SI95 and will double this year
• Farm has a big mixture of work– Not all production – “small” analysis jobs too– We make heavy use of LSF hierarchical scheduling
• Data access demands are enormous – DSTs are huge, many people, frequent accesses– Analysis jobs want many files
• Tape access became a bottleneck– Farm can no longer be satisfied
JLab Farm Layout
Gigabit Ethernet
QuadSUN E4000
STK RedwoodSTK RedwoodTape DrivesTape Drives
Fast EthernetGigabit EthernetSCSI2 FWDSCSI2 UWD/S
Fast Ethernet
DualPII 450MHz
Qty. 20
18GBUWS
DualPII 400MHz
Qty. 20
18GBUWS
Cisco Cat 5500
QuadSUN E3000
400GB
stage
150GB200GB
stage
MetaStorSH7400
File Server
3TBUWD
work
MetaStorSH7400
File Server
3TBUWD
work
Plan - FY 2000
STK 9840STK 9840Tape DrivesTape Drives
FARM SYSTEMSFARM SYSTEMS
MASS STORAGEMASS STORAGESERVERSSERVERS
WORK FILE SERVERSWORK FILE SERVERS
Cisco 2900
Gigabit Ethernet
Cisco 2900
DualPIII 500MHz
Qty. 25
18GBUWS
DualPIII 650MHz
Qty. 25
18GBUWS
DualSun Ultra2
400GBUWD
DualSun Ultra2
400GBUWD
CACHE FILE SERVERSCACHE FILE SERVERS
DualSun Ultra2
400GBUWD
DualSun Ultra2
400GBUWD
Cisco 2900
DualPII 300MHz
Qty. 10
18GBFWD
Ian Bird / Jefferson Lab CHEP 2000 7
Other farms
• Batch farm – 180 nodes -> 250
• Lattice QCD– 20 node Alpha (Linux) cluster– Parallel application development– Plans (proposal) for large 256 node cluster
• Part of larger collaboration• Group want a “meta-facility”
– Jobs run on least loaded cluster (wide area scheduling)
Ian Bird / Jefferson Lab CHEP 2000 8
Additional requirements
• Ability to handle and schedule parallel jobs (MPI)
• Allow collaborators to “clone” the batch systems and software– Allow inter-site job submission– LQCD is particularly interested in this
• Remote data access
Ian Bird / Jefferson Lab CHEP 2000 9
Components
• Batch software– Interface to underlying batch system
• Tape software– Interface to OSM, overcome limitations
• Data caching strategies– Tape staging– Data caching– File servers
Ian Bird / Jefferson Lab CHEP 2000 10
Batch software
• A layer over the batch management system – Allow replacement of batch system LSF, PBS
(DQS)– Constant user interface no matter what the
underlying system is– Batch farm can be managed by the management
system (e.g. LSF)– Build in a security infrastructure (e.g GSI)
• Particularly to allow remote access securely
Batch control systemLSF, PBS, DQS, etc.
Job submission system
Submission interface
Database
Query interface
User processesSubmission, query, statistics
Batch processors
Batch system - schematic
Ian Bird / Jefferson Lab CHEP 2000 12
Existing batch software
• Has been running for 2 years– Uses LSF– Multiple jobs – parameterized jobs (LSF now has job
arrays, PBS does not have this)– Client is trivial to install on any machine with a JRE –
no need to install LSF, PBS etc.• Eases licensing issues• Simple software distribution• Remote access
– Standardized statistics and bookkeeping outside of LSF• MySQL based
Ian Bird / Jefferson Lab CHEP 2000 13
Existing software cont.
• Farm can be managed by LSF– Queues, hosts, scheduler etc.
• Rewrite in progress to:– Add PBS interface (and DQS?)– Security infrastructure to permit authenticated
remote access– Clean up
Ian Bird / Jefferson Lab CHEP 2000 14
PBS as alternative to LSF
• PBS (Portable Batch System – NASA)– Actively developed– Open, freely available– Handles MPI (PVM)– User interface very familiar to NQS/DQS users– Problem (for us) was the (lack of a good) scheduler
• PBS provides only a trivial scheduler, but• Provides mechanism to plug in another• We were using hierarchical scheduling in LSF
Ian Bird / Jefferson Lab CHEP 2000 15
PBS scheduler• Multiple stages (6), can be used or not as required, in
arbitrary order– Match making – matches requirements to system resources– System priority (e.g. data available)– Queue selection (which queue runs next)– User priority– User share: which user runs next, based on user and group
allocations and usage– Job age
• Scheduler has been provided to PBS developers for comments – and is under test
Ian Bird / Jefferson Lab CHEP 2000 16
Mass storage
• Silo – 300 TB Redwood capacity– 8 Redwood drives– 5 (+5) 9840 drives– Managed by OSM
• Bottleneck:– Limited to a single data mover– That node has no capacity for more drives
– 1 TB tape staging RAID disk• 5 TB of NFS work areas/caching space
Ian Bird / Jefferson Lab CHEP 2000 17
Solving tape access problems
• Add new drives – 9840’s– Requires 2nd OSM instance
• Transparent to user
• Eventual replacement of OSM• Transparent to user
• File pre-staging to the farm• Distributed data caching (not NFS)• Tools to allow user optimization• Charge for (prioritize) mounts
Ian Bird / Jefferson Lab CHEP 2000 18
OSM
• OSM has several limitations (and is no longer supported)– Single mover node is most serious
• No replacement possible yet• Local tapeserver software solves many of
these problems for us– Simple remote clients (Java based) – do not
need OSM except on server
Ian Bird / Jefferson Lab CHEP 2000 19
Tape access software
• Simple put/get interface,– Handles multiple files, directories etc.
• Can have several OSM instances, but a unique file catalog, transparent to user– System fails over between servers
• Only way to bring 9840’s on line• Data transfer is network (socket) copy in Java• Allows a scheduling/user allocation algorithm to be
added to tape access• Will permit “transparent” replacement of OSM
Ian Bird / Jefferson Lab CHEP 2000 20
Data pre-fetching & caching• Currently
– Tape – stage disk – network copy to farm node local disk– Tape – stage disk – NFS cache – farm
• But this can cause NFS server problems
• Plan:– Dual solaris nodes with
• ~ 350 GB disk (RAID 0)• Gigabit ethernet• Provides large cache for farm input
– Stage out entire tapes to cache• Cheaper than staging space, better performance than NSF• Scaleable as the farm grows
JLab Farm Layout
Gigabit Ethernet
QuadSUN E4000
STK RedwoodSTK RedwoodTape DrivesTape Drives
Fast EthernetGigabit EthernetSCSI2 FWDSCSI2 UWD/S
Fast Ethernet
DualPII 450MHz
Qty. 20
18GBUWS
DualPII 400MHz
Qty. 20
18GBUWS
Cisco Cat 5500
QuadSUN E3000
400GB
stage
150GB200GB
stage
MetaStorSH7400
File Server
3TBUWD
work
MetaStorSH7400
File Server
3TBUWD
work
Plan - FY 2000
STK 9840STK 9840Tape DrivesTape Drives
FARM SYSTEMSFARM SYSTEMS
MASS STORAGEMASS STORAGESERVERSSERVERS
WORK FILE SERVERSWORK FILE SERVERS
Cisco 2900
Gigabit Ethernet
Cisco 2900
DualPIII 500MHz
Qty. 25
18GBUWS
DualPIII 650MHz
Qty. 25
18GBUWS
DualSun Ultra2
400GBUWD
DualSun Ultra2
400GBUWD
CACHE FILE SERVERSCACHE FILE SERVERS
DualSun Ultra2
400GBUWD
DualSun Ultra2
400GBUWD
Cisco 2900
DualPII 300MHz
Qty. 10
18GBFWD
Ian Bird / Jefferson Lab CHEP 2000 22
File pre-staging
• Scheduling for pre-staging is done by the job server software– Splits/groups jobs by tape (could be done by user)– Makes a single tape request– Holds jobs while files are staged– Implemented by batch jobs that release held jobs– Released jobs with data available get high priority– Reduces job slots blocked by jobs waiting for data
Ian Bird / Jefferson Lab CHEP 2000 23
Conclusions
• PBS is a sophisticated and viable alternative to LSF
• Interface layer permits – use of same jobs on different systems – user
migration – Add features to batch system