ian c. smith the university of liverpool condor pool

18
Ian C. Smith The University of Liverpool Condor Pool

Upload: wendy-hunt

Post on 01-Jan-2016

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ian C. Smith The University of Liverpool Condor Pool

Ian C. Smith

The University of Liverpool Condor Pool

Page 2: Ian C. Smith The University of Liverpool Condor Pool

University of Liverpool Condor Pool contains around 300 machines running the University’s Managed

Windows (XP) Service.

most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.

software updates via a weekly re-imaging process.

single combined submit host / central manager running on Sun Solaris V440 SMP server.

restricted access to submit host for registered Condor users.

currently running Condor 7.0.2 (moving to 7.4.2 soon hopefully).

policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

Page 3: Ian C. Smith The University of Liverpool Condor Pool

Condor service caveats

only suitable for DOS-based applications running in batch mode no communication between processes possible (“pleasantly

parallel” applications only) statically linked executables work best (although can cope with

DLLs) all files needed by application must be present on local disk

(cannot access network drives) no built-in check-pointing or standard output/error streaming shorter jobs more likely to run to completion (10-20 min seems to

work best) very long running jobs can accommodated using Condor

DAGMan or user level check-pointing

Page 4: Ian C. Smith The University of Liverpool Condor Pool

MATLAB advantages originally developed for development of linear algebra algorithms

but now contains many built-in functions geared to different disciplines divided into toolboxes

intuitive interactive environment allows rapid code development

simple but powerful file I/O: save <filename>, load <filename> (useful for check-pointing).

allows users to create their own functions stored as M-files

“standalone” applications can be built from M-files: can run on platforms without MATLAB installed do not need a licence to be able to run can include all toolbox functions

APIs available for FORTRAN and C codes (“MEX files”)

Page 5: Ian C. Smith The University of Liverpool Condor Pool

MATLAB disadvantages even standalone applications can run slower than equivalent C or

FORTRAN implementations.

standalone applications aren’t quite what they may seem: more than just an .exe – “manifest” file needed to locate run-time libraries need access to MATLAB run-time libraries usually via MATLAB Component

Runtime (150 MB self-extracting .exe) luckily we have MATLAB pre-installed on all PCs in Condor pool (originally

used a network drive)

run-time errors can be difficult to trace when MATLAB jobs are run under Condor: need to run under Condor on local PC configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

Page 6: Ian C. Smith The University of Liverpool Condor Pool

Condor/MATLAB Research Applications predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary

Clinical Science)

modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science)

modelling of disease propagation in fish farms (Mathematical Sciences / Earth and Ocean Science)

testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics)

simulation of the infection of a bacterial cell by a virus (Mathematical Sciences)

modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy)

Page 7: Ian C. Smith The University of Liverpool Condor Pool

Avian influenza results

Page 8: Ian C. Smith The University of Liverpool Condor Pool
Page 9: Ian C. Smith The University of Liverpool Condor Pool
Page 10: Ian C. Smith The University of Liverpool Condor Pool

Power-saving at Liverpool have around 2 000 centrally managed PCs across campus which

were powered up overnight, at weekends and during vacations.

original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity

policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

makes extensive use of PowerMAN system from Data Synergy comprising: service which forces machines into a low-power state and reports machine

activity to Management Reporting Platform Management Reporting Platform - central server from where usage stats

can be retrieved and viewed via a web browser

Page 11: Ian C. Smith The University of Liverpool Condor Pool

Power-saving at Liverpool Have around 2 000 centrally managed PCs across campus which

were powered up overnight, at weekends and during vacations.

Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 15 minutes of inactivity

Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

Makes extensive use of PowerMAN system from Data Synergy comprising: service which forces machines into a low-power state and reports machine

activity to Management Reporting Platform Management Reporting Platform - central server from where usage stats

can be retrieved and viewed via a web browser

Page 12: Ian C. Smith The University of Liverpool Condor Pool

Adapting Condor for use with power-saving PCs Two main problems:

how to ensure Condor jobs are not evicted by hibernating PCs how to wake up dormant PCs to run Condor jobs on-demand

Originally used Microsoft system service to power-down PCs after 30 min inactivity: runs .bat file which checks if a user is logged in and shuts machine down if

not doesn’t detect owner of Condor job as a logged-in user need to check for presence of condor_exe.bat

PowerMAN service now prevents job eviction: can provide PowerMAN with a list of “protected programs” ensures that system remains active if a protected program is running include condor_starter process as a protected program (only present while

a Condor job is running).

Page 13: Ian C. Smith The University of Liverpool Condor Pool

Adapting Condor for use with a power-saving PCs Wake-on-LAN (“WoL”) used to bring hibernating machines back to full

power: NICs must be remain powered-up during hibernation/power-off NICs must be capable of waking machines on receipt of a “magic packet” network must be able to route “magic packets”

cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status): if more idle jobs in queue than Unclaimed machines then need to wake up

hibernating machines find number of powered up machines machines in each “teaching centre”

(classroom) estimate the number of hibernating machines in each teaching centre from total

number of machines in each sort centres from highest number of available machines to lowest wake up centres in turn until sufficient machines woken to meet the demand (or

all centres woken up) MAC addresses of machines are stored in files sorted according to teaching

centre (needed for Wake-on-LAN)

Page 14: Ian C. Smith The University of Liverpool Condor Pool

Automatic wake up issues Assumes that any job can run on any machine:

users cannot choose particular teaching centres or machines in their job Requirements

ideally, pool needs to be homogenous errors in Requirements specification can cause severe problems

(machines repeatedly wake up then hibernate) cron now includes a “sanity check” for this

Can only estimate number of hibernating machines in each centre

May wake up more machines than needed

Page 15: Ian C. Smith The University of Liverpool Condor Pool
Page 16: Ian C. Smith The University of Liverpool Condor Pool

Automatic wake up in action – Condor pool machine statistics

Page 17: Ian C. Smith The University of Liverpool Condor Pool

Automatic wake up in action – PowerMAN statistics

Page 18: Ian C. Smith The University of Liverpool Condor Pool

Recent and Future Developments

starting to make use of automatic wake-up features of Condor 7.4.1 (condor_rooster)

cron advertises/updates ClassAds for offline machines Condor matches offline machines to jobs and wakes up

machines as needed use slow ramp-up of wake-ups to prevent server “overload” users can now specify memory requirements, processor

speed, when to run jobs etc local tools available to assist in the preparation and

running of MATLAB jobs: m_file_submit, matlab_build, matlab_submit