building a secure condor ® pool in an open academic environment bruce beckles university of...
TRANSCRIPT
Building a secure Condor® pool in an open academic
environment
Bruce BecklesUniversity of Cambridge Computing
Service
Condor pool characteristics
• Large number (~1000) of similar/ identical workstations
• Workstations centrally managed• Primary purpose of workstations not for running Condor jobs
• Workstations are “public access” machines, i.e. available to all members of institution
Fundamental requirements
• Condor service in this environment must be: Stable:• Must not make machines any less stable
Low impact:• Must be unnoticeable to ordinary users
Secure:• Must not significantly increase the attack
surface
Stability• Only use the current Condor stable
series, not the development series• Extensive testing (months, 1000s of
test jobs) on small pool of workstations
• Disable any features of Condor not required by users
• Support only limited subset of Condor functionality (only Vanilla and Java universes)
Low impact• Gather usage statistics of target
workstations and only allow Condor to run at periods when they would normally be idle
• Will not run jobs if a user is logged in Custom ClassAd attribute with number of users
logged in
• Any user activity aggressively preempts Condor job Issue under standard Linux 2.6 kernels: USB
mouse and keyboard activity not detected
• Control Condor job’s environment and sterilise environment after job completion Handles jobs using up all available disk space
and not cleaning up after themselves, etc
Security
• What is our threat landscape? What are we worried about?
• How does this specifically relate to Condor? Specific security concerns… …and how we addressed them
Threat landscape• Threats internal to the environment are at
least as significant as external threats: Largest body of users (students) are untrusted
No clear separation of use of machines by trusted and untrusted users
• Access (often wholly or largely unrestricted) to the public Internet is a core requirement: Both for normal use of the machines and for
Condor jobs Firewalls are of little help
Specific security concerns (1)
• Reliable identification of machines: IP addresses useless as identifiers (IP
“spoofing”) So “strong” authentication required:
• Do not significantly increase the attack surface of machines: No daemons running as root that listen to the
network:• Privilege separation (see following talk)
• Control access to the Condor pool: Easiest at point of job submission Restricted number of centralised submit nodes
Specific security concerns (2)
• Controlling the job execution environment: Inspect job prior to running on machine Start job in a sterile environment Sterilise environment after job has run Job run under dedicated unprivileged user account
• Restrict access to the Condor commands: Ideally develop separate front-end to Condor
system Currently just wrapper scripts for Condor
commands Can be circumvented (in some cases), so piloting
service with relatively trusted users
Strong authentication• Currently only available under UNIX/Linux• Kerberos or GSI• GSI:
Flawed security paradigm (mandates daemons run as root, etc)
Serious usability and scalability issues• Kerberos:
KDCs provide separate audit trail Plan to use Kerberos elsewhere in the University Support for Kerberos under Windows and MacOS X is
being added to Condor; support for GSI is not (functional GSI libraries not available)
Bug in Kerberos support in the stable series of Condor:• Backported patch from development series to fix
• Kerberos has proved surprisingly easy to deploy and administer in our setup
Scalability / Performance• condor_schedd (job queue management) doesn’t
scale well: “Monolithic” process: performs too many different tasks Uses blocking connections in stable series In our experience:
• Performs very badly above 4,000 jobs• Falls over above 10,000 jobs• Cannot handle significant numbers of short-running (less
than 5 minute) jobs• Job overhead is such that jobs need to be about 10 minutes
long to be worth running under Condor
• Not much we can do about this: Add more submit nodes as demand on our service rises Educate our users to use service sensibly (e.g. “batch up”
short running jobs) Wrap / replace Condor commands to encourage sensible
behaviour / mitigate some of these problems Lobby Condor Team to re-design the condor_schedd
daemon
Partitioning the pool• Require ability to only allow jobs from certain
users to run on certain machines: No sensible way provided to do this Restriction via lists of users or machines in
configuration files / ClassAd attributes is unwieldy and doesn’t scale
• Our method: Machines configured to only accept jobs with
particular ClassAd attribute Set automatically by our wrapper scripts based on
user’s identity On execute nodes cross check user against
independently maintained and distributed (via LDAP) ACL – this prevents users falsifying the ClassAd attributes
Architectural overview• Large number of centrally managed
“public access” workstations running Linux
• Jobs only run when no users are logged in• Centralised submit node(s)• Wrappers around Condor commands• Restricted (but still useful) subset of
Condor’s functionality• Machine identity strongly authenticated• Improved Condor security model:
Privilege separation on execute nodes Strict control of job environment
Conclusion• Although Condor not designed for a “hostile”
environment, it can be used relatively securely in such environments (some caveats naturally)…
• …under Linux…• …but a lot of development work is required to
achieve this…• …and it requires the supporting infrastructure of
a stable, centrally managed workstation service.• Improvements to Condor would make this
significantly easier: Design for a hostile environment. These days,
most environments are.