resource management system for distributed environment b4. nguyen tuan duc
TRANSCRIPT
Resource management system for distributed
environmentB4. Nguyen Tuan Duc
Background
Emerging need for resource management system of clusters / grids
Several systems exist, but have problems… Portable Batch System Sun Grid Engine ….
Goal
Flexible resource management system Support clusters, grids Fair-share scheduling Maximize utilization of resources Support parallel applications Reduce load aggregation
Agenda
Background Goal Related works Proposal method Problems
Related works
Portable Batch System (MRJ 1990s) Batch queuing system Automatic load-balancing Parallel jobs support Job accounting
Portable Batch System (PBS)
Sun Grid Engine
Batch queuing system by Sun Microsystems Same features with PBS, and Job checkpoint Several add-ons
Problems of batch queuing systems Resource utilization Load aggregation
Server accept too many requests from clients Limit of execution model
Cannot fork, since process created with fork() does not go into the queue
…
Saito Dai’s system (STDS)
Flexible Resource Management System for Widely Distributed Environment (2006) No load aggregation Job scheduling on each node Independent from execution model (fork, … OK) Support parallel jobs
STDS structure
Two main components Node searching system (graph searching) Scheduler (on each node)
Scheduler Daemon on each node CPU fair-sharing by ‘nice’
Node searching system Create graph from links Node search graph search
STD node searching system
Our approach
Similar to STD system Node searching system Scheduler on each node
But different in … Node search: no graph searching Scheduler: kernel scheduler with user accounting
(budget scheduler)
Scheduler: Budget scheduling Budget scheduling Normal queue & budget queue Normal queue for interactive processes
Linux 2.6 default scheduler Budget queue for CPU-hogging processes
Automatic detecting of CPU-intensive process http://www.logos.ic.i.u-tokyo.ac.jp/~duc/pre/1
107.ppt
Node searching system
Client-server model Daemon on each node Daemon reports CPU state (process number,
CPU utilization, …) directly to user Reports maximum price
From where user can submit jobs? From every where on the cluster, grids From their desktop, via the Internet
Need of a job submitting system
Node searching system (NSS)
User
Who will determine nodes?
User! Users choose nodes appropriated to their
jobs Parallel jobs: idle CPUs or CPUs with low-price
jobs Long-last jobs: idle CPU, set low-price
Node searching system (NSS) NSS should report to users:
CPU utilization Maximum price Load (process number, ..) …
Daemon on each node sends information about the node to client.
Client is on user’s machine No heavy load aggregation
Problems!!!
May be heavy load on user client NAT, Firewall
How client can connect to server?? Information need?
Only CPU utilization, maximum price, load, average-price?