processor allocation - philadelphia university · example using amdahl’s law • suppose that a...

Processor allocation

System performance

Queueing systems are useful because it is possible to model them (input and service processes) analytically.

Let us call the total input rate λ requests per second , from all the users combined. Let us call the rate at which the server can process requests µ.

For stable operation , we must have:

µ > λ.

TNratearrival ′=λ ,

λ= 1TimealInterarriv.Av

• The arrival rate λ is the number of arrivals per unit of time.

• The inter arrival time ( λ1

) is the time between each arrival into the system a nd the next.

- Prove why inter arrival time equal to ( λ1

)

• Let Average service rate µ is the number of process that can be served per unit of time.

∑=

=µN

1iiS

N)unittimeperprocess(RateService.Av

)process(sec/TimeService.Avtheis1µ

T

S

S

NT

N

nUtilizatioi

i

′=′=

µλ=ρ= ∑

∑

Example: For stable operation, we must have ( µ > λ ), If the server can handle 100 requests/sec, but the users continuously generate 110 requests/sec, the queue will grow without bound then it is unstable.

The mean response time (RT) and the mean turnaround time (TT).

The mean time between issuing a request and getting a complete response to produce the first output is the mean response time RT, while the mean of total time from process submission to process completion is the mean turnaround time TT.

M/M/1 isolated workstation WS

M- it means that the system applied exponential distribution for both arrival λ and service times µ.

λ−µ== 1

TR WSof timeresponse ofmean The 1

λ−µ== 1

TT WSof timed turnarounofmean The 1 // proof

Prove why RT=TT

( )

( ) ( ) λ−µ=

λ−µµλ−µ+λ=

µ+

λ−µµλ=

λ−µµλ=

µ×

λ−µλ=

=λ−µ

λ=ρ−

ρ==

µ==

µ+=+=

λ−µ==

11RT

1W

queuetheinprocessesofnumbermax1idleCPU

busyCPUN

1*NtimeserviceAv*queuetheinprocessesofnumbermaxW

1WtimeServiceWRT

1/M/Min1

TR WSof timeresponse that Proof 1

λ−µ=

λ−µµµ=

λ−µµλ−µ+λ=

µ+

λ−µµλ=

λ−µµλ=

λ−µµλ+µ−µ=

µ−

λ−µ=

µ−=

µ+=+=

λ−µ=

1

)()(

1

)(TT

)()(

111RTW

1WtimeserviceWTT

1RT

q

qq

Q

Utilization of one WS

( )

1 optimality

)fastisserver(Sif 1 :conditionStability

slowisserverSif

T

N

N

S ratearrival

nsinstructioof#

serverofspeedratearrivaltimeservice.Av

1 intensity) (trafficn Utilizatio

=ρ

→<<<ρ

→>>

′×=

×=×

=

λ×µ

=µλ=ρ

∑∑

∑

Example , consider a file server that is capable of handling as many as 50 requests/sec ( µ ) but which only gets 40 requests/sec. The mean response time at WS with one processor will be 1/10 sec or 100 msec.

secm100sec10

1

4050

11TR timeresponse ofmean The 1 ==

−=

λ−µ==

Note : when λ goes to 0 (no load), the response time of the file server does not go to 0, but to 1/50 sec or 20 msec.

secm20sec50

1

050

11TR timeresponse ofmean The 1 ==

−=

λ−µ==

Problem / The above example is obvious with respect to mathematical formula. If the server can process 50 requests/sec, it must take 20 msec to process a single request, so the response time, which includes the processing time, can never go below 20 msec. //what is logical reason of this value 20 msec . …discuss this

secm100thanlessand

secm100timeresponsetheso),0(requestnewhaveweifswitchingperformwillsystemthewhile

secm20needsitsoswitchingcontextoutwithcompletlyrequestoneservedshouldserverthe

0T

0N

T

Nwhy//

Tmeansit0if

0Nthen0

=>λ

≈>>′>=

′=λ

>>′=λ>>µQ

2- Using processor pool model (multiprocessors system ) M/M/n

Now consider what happens if we scoop up all the CPUs and place them in a single processor pool. Instead of having n small queueing systems running in parallel, we now have one large one, with an input rate nλ and a service rate nµ. // (service time 1/ nµ).

Let us call the mean response time of this combined system RTn. From the formula above we find

M/M/n processor pool model

n

RT

n

1

nn

1TR timeresponse ofmean The 1

n =

λ−µ=λ−µ

==

( )λ−µ==

nn

1TT timed turnarounofmean The n // Proof

λ−µ==

µ+

µ−=

µ+=

nn

1RT

n

1

n

1RT

n

1WTT q

Utilization of n processors

( )

1 when state optimal

)fastisserver(Sif 1 :conditionStability

slowisserverSif

T

N

N

S

n

1 ratearrival

nsinstructioof#

serverofspeed

n

1ratearrivaltimeservice.Av

n

1

1n1

n intensity) (trafficn Utilizatio

=ρ

→<<<ρ

→>>

′×=

×=×

=

λ×µ

=µ

λ=ρ

∑∑

∑

This surprising result says that by replacing n small resources by one big one that is n times more powerful, we can reduce the average response time n- fold. This result is extremely general and applies to a large variety of systems.// it needs proof.

Example : It is one of the main reasons that airlines prefer to fly a 300-seats 747 once every 5 hours to a 10-seats business jet every 10 minutes. // why? Because the effect arises by dividing the processing power into small servers (e.g., personal workstations), each with one user, is a poor match to a workload of randomly arriving requests.

Proof

For the case of 10 seats every 10 min

10TWhy//

90

1

10100

1

10

1010

1

1010

1

T

N10

S

N10

1

1010

1RT

Nserversofnumberlet

110

10

Time

)N(seatsofNumberSserviceofspeedtheLet

10

=′

=−

=−

=

′−

=λ−µ

=

=

===

For the case of 300 seats every 5 hours (300 min)

300TWhy//

897001

300900001

300300

3001

300300

1

TN

300SN

300

1300300

1RT

1605

300SserviceofspeedtheLet

300

=′

=−

=−

=

′−

=λ−µ

=

=×

=

seats300withflyingpreferwethen89700

1901 >Q

Why isolated workstations model if pool model is be tter?

• Given a choice between one centralized 1000-MIPS CPU and 100 private dedicated 10-MIPS CPUs, the mean response time of the one CPU system will be 100 times better, because no cycle are ever wasted. //why?

• However, mean response time is not everything. There are also cost . If a single

1000-MIPS CPU is much more expensive than 100 10-MIPS CPUs,

in WSbetter much bemay is ons workstati theof ratio eperformanc

price the

• Reliability and fault tolerance are also factors // problem in the single queue of pool.

• So far we have tacitly assumed that a pool of n processors is effectively the same thing as a single processor that is n times as fast as a single processor. In reality, this assumption is justifies only if all requests can be split up in such a way as to allow them to run on all the processor in parallel, but if a job can be split into, say, only 5 parts, then the processor pool model has an effective service time only 5 times better than that of a single processor, not n times better.

A Hybrid Model

A possible compromise is to provide each user with a personal workstation and to have a processor pool in addition. Although this solution is more expensive than either a pure workstation model or a pure processor pool model, it combines the advantages of both of the others.

Speedup performance measures

Computational is a function of the algorithm design and the efficiency of the scheduling algorithm that maps the algorithm onto system architecture.

Model speedup factor S, which is the function of parallel algorithm , system architecture , and scheduling of execution :

S = F(Algorithm of user app., System Archi, Scheduling algorithm)

Let us define the following time aspects:

OSPT- optimal sequential processing time : Is the best time that can be achieved on single processor using best sequential algorithm.

CPT- concurrent processing time : Is the actual time achieved on an n-processors system with the concurrent (parallel ) algorithm and specific scheduling algorithm.

OCPTideal – optimal concurrent processing time at ideal system: Is the best time that can be achieved with the concurrent (parallel) algorithm on ideal n-processors system

// (no IPC overhead and using optimal scheduling algorithm).

Si – The ideal speedup obtained by using n-processors system over the best sequential time.

Sd – the degradation of the system due to actual implementation compared to an ideal system.

Speedup S = (optimal sequential processing time) / (Parallel or concurrent Execution Time)

Let (RC) be Relative Concurrency : which measures how far from optimal the usage of the n-processors is. In another words, how well adapted the given problem and its algorithm are to the ideal n-processors system.

Let (RP) be Relative Processing : the ratio of the total computation time needed for parallel algorithm to the time for the optimal sequential algorithm.

idealm

1ii

ideal

m

1ii

i OCPT

OSPTn

OSPT

P

nOCPT

P

nRP

RCS =××=×=

∑

∑

=

=

Where m-number of tasks, in the algorithm, n-number of processors, ∑=

m

1iiP is the total

computation of the concurrent algorithm.

Eff1

1Sd +

=

Where ideal

ideal

OCPT

OCPTCPTEff

−=

Then CPT

OCPT

OCPT

OCPTCPT1

1S ideal

ideal

ideald =−+

=

Eff is the efficiency loss : The ratio of real system overhead due to all causes to the ideal optimal processing.

ideal

ideal

OCPT

OCPTCPTEff

−=

systemsched EffEffEff +=

The efficiency loss due to the scheduling algorithm and system architecture .

//Discuss why optimal Eff is hard to reach?

Amdahl’s Law

Speedup =S(n) = ( Serial Execution Time ) / (Parallel Execution Time )

= Ts / Tp

• Let f = fraction of computation (algorithm) that is serial and cannot be parallelized due to:

– Data setup – Reading/writing to a single disk file (non sharable resource)

Ts = serial portion + parallel portion

Ts = f * Ts + (1-f)* Ts

Tp = f * Ts + ( (1-f)* Ts ) / n

nTs) f)-((1

Ts f Tp×+×=

• Speedup = ( )( )n

Tsf1Tsf

Ts

Tp

Ts)n(S ×−+×

==

• Speedup =S(n)= Law Amdahls1f)1n(

n =+×−

Proof: //HW

( )( )

( )( ) ( )( ) ( )( )

( ) Law Amdahls1f)1n(

n

1f)1n(Ts

Tsn

TsTsf)1n(

Tsn

TsTsfTsfn

Tsn

Tsf1Tsfn

TsfnTsnTsfn

n

Tsf1TsfnTs)f1(Tsf

n

Tsf1Tsf

Ts)f1(Tsf

Tp

Ts)n(S

n

Tsf1TsfTp

Ts)f1(TsfTs

=+×−

=+×−

×=+××−

×=+×−××

×

×−+××××−×+××=×−+××

×−+×=×−+×

×−+×==

×−+×=

×−+×=

Q

Q

Q

Proof why the maximum speedup is equal to 1/f

( ) speedupMax

f1

fnn

=S(n)then

fn1fnandn 1-n

then n iffS(n)maxfor

Law Amdahls1f)1n(

n =S(n)

==×

×≈+×≈>>

=+×−

Example using Amdahl’s Law

• Suppose that a calculation has a 4% serial portion, what is the limit of speedup on 16 processors? what is the maximum sp eedup?

Ans

10 .04*1) -(16 1

16f* 1)-(n 1

n S(16) up Speed =

+=

+==

Max speed up= 1/f = 1/0.04 = 25

PROCESSOR ALLOCATION

• By definition , a distributed system consists of multiple processors. These may be organized as a collection of personal workstations , or a public processor pool , or some hybrid form . In all cases, some algorithms are needed for process or processor allocation.

• For the workstation model ( when & when ), the question is: When to run a process locally (specific work station) ? // policy of process allocation When to look for an idle workstation ?

//(process insert to workstation randomly and when we have no place to set new process at specific WS then DOS check another WS which is idle.

• For the processor pool model ( which & which ), the question is : Which process is assigned to which processor ?

//a decision must be made for every new process

• We will follow tradition and refer to this subject as "processor allocation " rather than "process allocation ," // What is the difference between them?

Allocation Models

Before looking at specific algorithms, or even at design principles, it is valuable saying something about the basic model , assumptions , and goals of the work on processor allocation.

Processor allocation strategies can be divided into two broad classes .

The first class is static allocation called nonmigratory allocation algorithms, when a process is created:

- A decision is made about where to put it. - Once placed on a machine. - The process stays there until it terminates. - It may not move. - No matter how badly overloaded its machine becomes. - No matter how many other machines are idle.

The second class is dynamic allocation called migratory allocation algorithms , when a process is created:

- A process can be moved even if it has already started execution. - Using migratory strategies allow better load balancing. - They are more complex and have a major impact on system design.

Implicit in an algorithm that assigns processes to processors is that we are trying to optimize something.

1 Maximize CPU utilization

// maximize the number of cpu cycles actually executed on behalf of user jobs per hour of real time. Maximizing CPU utilization is another way of saying that CPU idle time is to be avoided at all costs. When in doubt, make sure that every CPU has something to do.

2- Maximizing the throughput.

3- Minimizing waiting time WT.

4-Minimize turnaround time TT.

2- Minimizing mean response time RT.

Example : The two processors in pool model with two processes as in the following Fig .

Fig. Response times of two processes on two processors

• Processor 1 runs at 10 MIPS; • Processor 2 runs at 100 MIPS, but has a waiting list of backlogged (accumulative)

processes that will take 5 sec to finish off. • Process A has 100 million instructions and • Process B has 300 million instructions . The response times for each process on

each processor (including the wait time) are shown in the figure.

The question is : Which process is assigned to which processor ? Draw your conclusion.

.

Clearly, the first allocation is a better assignment in terms of minimizing mean response time. //what we conclude?

Response ratio metric:

Another metric called response ratio which is defined as the amount of time it takes to run a process on selected processor (SP) divided by how long it would take on some unloaded benchmark processor (BMP).

BMPintime

SPontimeofamount)RR(RatioRspronse =

For many users, response ratio is a more useful metric than response time.

Example: Let a 1-sec job that takes 5 sec and let a 1–min job that takes 70 sec , Using response time, the former(1-sec job) is better, but using response ratio, the latter (1 –min job ) is much better because 5/1>>70/60. Proof: Let selected processor is SP Let bench mark processor is BMP Then we found:

BMPintimeSPintime

)RR(RatioRspronse =

Process

# Time in BMP Time in SP RR RT

1 1 sec 5 sec 5/1=5 5 2 60 sec (1 min) 70 sec 70/60=1.16 70

//What we are concluded? HW

Design Issues for Processor Allocation Algorithms

In multiprocessor systems like pool architecture (with shared memory), Processor allocation was not a serious problem when we examined the system performance. //why //Ans: In those systems, all processors had access to the same image of the operating system and grabbed jobs from a common job queue. When a quantum expired or a process blocked, it could be restarted by any available processor(idle processor). In multicomputer systems like work stations , things get more complex. We may not be able to use shared memory segments to communicate with other processes. Message or files are used for the communication among processes. The file system may look different on different machines. The overhead of dispatching a process on another system may be high compared to the burst time of the process. // high system overhead due to process transmission

A large number of processor allocation algorithms have been proposed over the years. In this section we will look at some of the key choices involved in these algorithms and point out the various trade-offs. The major decisions the designers must make can be summed up in five issues:

1- Deterministic vs. heuristic if we know all the resource usage (computed time req ., file req ., communication req .,…)by process in advanced, we can create a deterministic algorithm to find proper machine with proper resources. Some data of process is usually unknown and heuristic techniques often have to be employed.// Discuss these methods using heuristic algorithm and a new approach which is called meta-heuristic algorithms. 2- Centralized, hierarchical, or distributed

• A centralized algorithm allows central machine to collect all a necessary information for making scheduling decisions to reside (be located in) process in one place but it can also put a heavy load on the central machine.

• A hierarchical system , we have a number of machines as load managers, organized in a hierarchy.

• Distributed algorithms (Decentralized ) are usually preferable, but some centralized algorithms have been proposed for lack of suitable decentralized alternatives.

3- Optimal vs. suboptimal We are trying to find the best processor allocation, or an acceptable one. Optimal solutions can be obtained in both centralized and decentralized systems, but are always more expensive than optimal vs. suboptimal ones. They involve collecting more information, and processing is more proficient. 4- Local or global ( traditional approach ) When a process is about to be created , a decision has to be made whether or not it can be run on the machine where it is being generated using local information . If that machine is too busy, the new process must be transferred somewhere else using global information. This is known as the transfer policy . 5- Location policy (traditional approach) Does the machine send requests asking for help or does it send requests for work to perform?

Implementation Issues for Processor Allocation Algo rithms Centralized algorithm 1 // machine that was created process is not mandatory to run this

process Step1: system creates a new process; Step2: picks a machine at random by centralized coordinator (centralized machine) ; Step3: sends the new process to selected machine in step2; Step4: if the selected machine is overloaded { Step4-1 Send process off ; //waiting in queue until finding ready machine by the

coordinator(machine) Step4-2 goto Step1; } Else go to Step5 ; // machine willing to take process Step5: run process in the selected machine & set no more forwarding is permitted . //assign one process to one machine at a time Step6 : go to step1 if counter is not exceeded. // what it means�go to step1 if not all

processes are executed Modified approach (Proposed) Select ready WS from queue instead of using random selection.

Centralized algorithm 2 (up-down algorithm) // machine that was created process is mandatory to run this process if it is under loading.

Step1: Initialize penalty point to all systems equal to zero; Step2: DOS of local machine creates process; Step3: if local machine that create process is crowded for local executing then Step3-1 send request to centralized coordinator (centralized machine) asking for

help to run process at any where; Step3-2 centralized coordinator ask workstation with lower score (with minimum

penalty points) Step3-3 centralized coordinator add 1 to penalty point of the local machine at

each clock tick; Else Step3-1 centralized coordinator subtract 1 from penalty point of the local system

at each clock tick if local work station is requested or idle; Step4: go to step2 if counter is not exceeded;

Fig. Operation of the up-down algorithm.

Hierarchical algorithm :

Centralized algorithms, such as up-down, do not scale well to large systems. The central node soon becomes a bottleneck, not to mention a single point of failure. These problems can be attacked by using a hierarchical algorithm instead of a centralized one. Hierarchical algorithms keep much of the simplicity of centralized ones, but scale better.

// needs proof

Step1: create new process; Step2: Trying to pick up manager machine at random; Step3: if number of selected manager (Machine as a Dept head) does not exceeds the

manager-limit then Step3-1 select manager randomly which is trying to find suitable machine

(worker); Step3-2 selected machine by manager ; Step3-3 manager ask machine if it is under loading or overloading; Step3-4 if machine being under loading then Step3-4-1 sends new process to selected machine; Step3-4-2 run process & set no more forwarding is permitted //assign

one process to one machine at a time Else Step 3-4-1 go to step 3 to find new manager (go to up hierarchical ); Else Step 3-1 stay process in the queue of machine that created it until finding new

manager; // all managers are busy. Step4: go to step1 if counter is not exceeded;

Distributed algorithm 1 Step1 create process by system ; Step2 generate K-mangers to check K- machines to determine their exact loads. Step3: Send process to the machine with the smallest load.

Each machine is working as manager.

Distributed algorithm 2: Each machine is working as manager. Sender side: (from manager) Step1: create new process by local system; //local system�local machine with its local OS Step2: if local system needs help in running process then //due to overloading Step 2-1 pick machine at random; Step2-2 send message to selected machine "can you run my process?"; Step2-3 if selected machine can not hold the process then Step 2-3-1 goto step 2-1; Else Step 2-3-1 send process to selected machine for running; Step2-3-2 goto step1 if counter is not exceeded; Else Step2-1 local system running the process; Step2-2 go to step1 if counter is not exceeded Step3 stop Receiver side: Step1: if system is not loaded then Step 1-1 pick machine at random; Step1-2 send message "I have free cycles "; Step1-3 if can not hold any process, repeat the same request.

Fig. (a) A sender looking for an idle machine. (b) A receiver looking for work to do.

Theoretic Deterministic Algorithm

For system consisting of processes with known CPU time and memory requirements, and a known matrix giving the average amount of traffic between each pair of processes. If the number of CPUs is k, and it is smaller than the number of processes, several processes will have to be assigned to each CPU.

The idea is to perform this assignment such as to minimize network traffic.

The system can be represented as a weighted graph , with each node being a process and each arc representing the flow of messages betw een two processes .

Mathematically, the problem then reduces to finding a way to partition (i.e., cut) the graph into k disjoint sub-graphs, subject to certain constraints (e.g., total CPU and memory requirements below some limits for each sub-graph). For each solution that meets the constraints, arcs that are entirely within a single sub-graph represent intra-machine communication and can be ignored. Arcs that go from one subgraph to another represent network traffic..

In Fig. (a), we have partitioned the graph with processes A, E, and G on one processor, processes B, F, and H on a second, and processes C, D, and I on the third. The total network traffic is the sum of the arcs intersected by the dotted cut lines, or 30 units .

In Fig. (b) we have a different partitioning that has only 28 units of network traffic. Assuming that it meets all the memory and CPU constraints, this is a better choice because it requires less communication.

Naturally, what we are doing is looking for clusters that are tightly coupled (high intracluster traffic flow) but which interact little with other clusters (low intercluster traffic flow)

Another clustering using intelligent system:

A Bidding Algorithm (request algorithm) (Optimal vs . suboptimal)

Another class of algorithms tries to turn the computer system into a small economy, with buyers and sellers of services and prices set by supply and demand.

The key players in the economy are the processes , which must buy CPU time to get their work done , and processors, which auction their cycles off to the highest bidder .

Step1: Each processor advertises its approximate price by putting it in a publicly readable

file. //This price is not guaranteed, but gives an indication of what the service is worth and this price depending on their speed, memory size, presence of floating-point hardware, and other features, Different processors may have different prices.

Step2: Expected response time for each service provided by processor is also be published.

Step3: When a kernel process wants to start up a user process (parent or child), Step3-1 kernel process goes around and checks out which processor offering the

service that it needs. Step3-2 kernel process determines the set of processors whose services it can

afford. //From this set, it computes the best candidate, where "best" may mean cheapest, fastest, or best price/performance, depending on the application.

Step3-3 kernel process generates a bid and sends the bid to its first processor in the processors set. The bid may be higher or lower than the advertised price of selected processor (first processor).

Step3-4 Processors collect all the bids sent to them, and make a choice, most probably by picking the highest bid . //why � to fit the bid with the proper processor corresponding to the price of bid.

Step3-5 The winners and losers processes are informed, and the winning process is executed.

Step3-6 The published price of the processor is then updated to reflect the new going rate.

Bokhari's Algorithm (Optimal vs. suboptimal)

Bokhari’s algorithm minimizes the sum of the total communication cost and the maximum execution cost in a processor (high utilization).

Assumptions

1. Let number of processes is k. 2. Let number of homogeneous processors is n which are connected in a linear

topology. 3. Let k>n. 4. Let communication links are homogeneous. 5. Let computation and communication costs are known. 6. Two processes not communicating directly will not be allocated the same

processor. 7. Two communicating processes, if not allocated same processor, must be in the

adjacent processors. 8. Let P1,…,Pn be processors in linear order and p1,…,pk be processes in linear

order. Let the execution cost of p i be w i and the communication cost between p i and p i+1 be c i.

Algorithm steps:

Step1: Find sum of communication cost Sum1 between processes such that the communication cost of process p i with its adjacent processes equal to sum1

1ii CC1Sum −+= Where Ci is the communication cost between p i and p i+1 while Ci-1 is the communication cost between p i-1 and p i Step2: Find the maximum execution cost Sum2 for n processors.

( )indexprocessorjandindexprocessiwhere

wmax2Sumj

ij

==

= ∑∀

Step3: Find the objective function value= Sum1+Sum2;

The objective function refers to the cost of DS based on max execution and total communication between processors.

Example: using bad processors allocation

To reduce the communication cost; this algorithm proposed the following layered graph.

• The graph has n+2 layers • n is the number of processors. • k is the number of processes • The top and bottom layer has 1 node each. • All other layers have k nodes. • We number the layers from 0 thru n+1 and denote the ith node of jth layer by v(i,j). • The node in the top (0th) layer is adjacent to all nodes in layer 1. • The node in the bottom layer is adjacent to all nodes in the kth layer. • Other v(i,j)s are adjacent to v(i,j+1),…,v(k,j+1). • Each edge e in the graph has two weights, w1(e) and w2(e). //(total execution time

and communication cost respectively)

• If e is an edge connecting v(i,j) and v(r,j+1) then

- w1(e) is the total execution time of the processes pi to pr-1 �

mprocessfortimeserviceSwhere

S w1(e)

m

1r

imm∑

−

=

=

- w2(e) is the communication cost between pr-1 and pr.

Note A shortest path from the top to bottom on the basis of w2 will minimize the total communication cost. Any known shortest path algorithm may be used to do that.

Remark: To send the message (e) from V(i,j) to V(r,j+1) we must do the following:

1- Migrating process (pi) from Processor j to the processor j+1. 2- Sending message from pi to pi+1 by executing pi, 3- Step 2 is repeated until the message (e) is reached to pr-1. 4- At the process pr-1, we should find the amount of communication cost between pr-1

and pr.

We conclude that sending message (e) from V(i,j) to V(r,j+1) need to find the total execution time from pi to pr-1 represented by W1(e) and the communication cost represented by W2(e). Therefore, to minimize the cost, we should set the related processes closer to each other.

Q/ why we did not introduce total communication cost.

Ans: the cost is depending on message length and the speed of processor for each word , so we can not use total communication cost see the following fig.

Non-Migratory processor allocation (Static)

LS algorithm list scheduling

• Assign processes A, B, C to P1, P2, P3 respectively. • Check the next processes connected to process A, these processes are D and E

then we select the left most process appeared in the task graph. • Check the next processes connected to process B. theses processes are D and F,

then we select F since D is selected before. • Check the next processes connected to process C, these processes are E and F,

then we select the left most process E. • We don't care to the communication cost but we will be careful from the execution

time end to start execution of the next process.

ELS (Extended LS)

The selection of LS is implemented but we should insert the communication cost.

ETF Earliest task first.

We try to reduce the cost of communication by sending the largest cost to the same processor

Example : Let us define the following information of six processes working on two parallel processors. Apply ELS and ETS to show the time analysis of each processor with load balance and minimum time (if possible)

Process Scheduling In Distributed Systems

• The scheduling algorithm in general has two steps, 1- The allocation of processors to processes (process migration) //processor allocation 2- Determines the order in which jobs will be executed on processors.// process

allocation Allocation processes in the Parallel/multi-computer s system: Requirements: • Given a set of tasks with:

- Certain precedence constraints requirement (Two operations A and B are connected by a precedence constraint A -> B // it means if operation B can only be started when A is finished)

- Computation time (execution time) - and Communication time.

• Given a set of processors connected by a communication network, • Find the assignment of tasks to processors and the order of their execution that minimizes

the total execution time. //(processes allocation algorithm policy) Conditions 1- If two communicating tasks are allocated to the same node, the communication time between them is zero and the total response time is reduced. 2-if we need to increases parallelism and reducing total execution time in the computations, we should allocating tasks to different nodes.

� The scheduler performing the following loop: � Select the best job to run, according to policy and available

resources. � Start the job. � Stop the job and/or update burst time. � Repeat.

Assessing Schedulers Schedulers are assessed by two metrics:

1– The performance of the schedule generated (schedule length). 2– The efficiency of the scheduler: time taken by the scheduler to generate a schedule. Scheduling Characteristics 1- We should select suitable policy (scheduling algorithm) to minimize response time ,

average completion time per application and load balancing among processors. 2- Using Non-preemptive or Preemptive:

3- Using Non-adaptive or adaptive schedulers:

– Adaptive schedulers may change their behavior based on information received from the system. This change is necessary to increase system performance (reduce schedule length) but information gathering implies system overhead .

// Due to selecting the proper process for the current situation, see AFCFS policy is better than FCFS policy with respect to system performance but it needs more complexity/

Example:

Clustering Algorithms • Process of mapping nodes of a task graph into labeled clusters. • All nodes in the same cluster are allocated to the same processor. • Optimal clustering is NP-complete (more than one solution but one of them is better) • {T1, ..., Tn} ⇒ {C1, ..., Ck}// n>k

Linear clustering � Processor include one cluster

Non-Linear clustering � Processor include more than one cluster

Overcome Problem of Process Scheduling in DOS

To overcome the problem of share space problem, we should use Backfilling and Gang

scheduling to perform scheduling policy on related threads/tasks.

Q/ What is the problem of share space problem ?

Suppose that A0 and B0 start first in time slice 0, A0 sends A1 request but A1 did not get

the request until it runs in the time slice 1 starting at 100 msec. It sends the reply

immediately, but A0 does not get the reply until it runs again at 200msec. The net result is

one request reply sequence every 200msec.

Solving Space Sharing Problem

Another example of Space Sharing Problem , the FCFS is the simplest policy of Space

Sharing Scheduling and the following figure show the problem of this policy

Using backfilling optimization solution to overcome the above problem by filling holes from queue in FCFS order. (this process can be used if the dependency of the blocks is missing)

Backfilling optimization Scheduling

� if all jobs are arrived , the scheduler allows small jobs fit to free holes from the

back of the queue to execute before larger jobs that arrived earlier // � Identifies holes in the 2D chart and moves smaller jobs to fit and fill those holes � 2 types of Backfilling– conservative and aggressive (EASY) are used:

Conservative Backfilling Every job is given a reservation when it enters the system and a job is allowed to backfill only if it does not violate any of the previous reservations. Starvation cannot occur at all.

Aggressive EASY (Extensible Argonne Scheduling sYst em)

Only the job at the head of the queue is given a reservation and a job is allowed to backfill if it does not violate this reservation of the first process. Starvation cannot occur

So it is mandatory to think about another approach based on gang scheduling,

Gang scheduling

The gang scheduling algorithm is introduced through three main parts :

a) Groups of related threads/tasks scheduled as a unit (a gang).

b) All members of a gang run simultaneously on different timeshared CPU.

c) All gang members start and end together.(run simultaneously)

To show the implementation of the gang scheduling algorithm, consider five processes (A,

B, C, D and E) and a multiprocessor system with six CPUs, as given in figure During time

slot 0, related threads A0 through A5 are scheduled and run, during time slot 1 related

threads B0, B1, B2, C0, C1 and C2 are scheduled and run, during time slot 2, D has five

threads and E0 get to run, the remaining six threads belonging to process E run in time slot

3, then the cycle repeats with slot 4 being the same as slot 0 and so on.

.

Figure Gang Scheduling with preemptive/non-preeptive.

Figure Non-Gang Scheduling

• In a distributed real-time system, jobs usually consist of frequently communicating tasks which can be processed in parallel.

• An efficient way to schedule dynamic, parallel jobs is Gang Scheduling, which is a combination of time and space sharing (as in B & C).

• According to this technique, a parallel job is decomposed into tasks that are grouped together into a gang and scheduled and executed simultaneously on different processors.

• The number of tasks in a gang must be less than or equal to the number of available processors. If number of task is greater than number of proces sor then we need more than one slice.

Policies for Scheduling (processes allocation)

There are many possible methods to implement the gang scheduling algorithm, the well

known methods are:

a) The Adaptive First Come First Serve (AFCFS) method.

b) The Largest Gang First Served (LJFS) method.

c) Shortest Time First (STF) method.

Performance metrics In order to evaluate the system’s performance, we apply the following performance metrics, summarized in the following Table :

Reponse Time rj Response time rj of a parallel job j refers to the time that it takes for the job to be processed by the system from the moment that the job arrives to the dispatcher(select from the top of queue). The Average Response Time (ART) is defined as follows

Where N is the number of jobs.

We also include Average Weighted Response Time which takes into account the size of each job which represents an important factor since highly parallel jobs are the prime

Where p(xj ) is the number of tasks . The Average Waiting Time (AWT) and Average Weighted Waiting Time (AWWT) :

Waiting Time wj of a parallel job j refers to time between arrival time and time begin service.

Slowdown Sj is the response time of the job divided by its service time.

Where ej is the actual runtime of job j (execution time using multi processors) usedslotofnumbertimeslote j ×= .

a-Adapted-First-Come-First-Served (AFCFS )

This method attempts to schedule a gang whenever processors assigned to its tasks are available.

• Select job at the front of queue. • When there are not enough processors available for a large job whose

tasks are waiting in the front of the queues, AFCFS policy schedules selects a smaller jobs whose tasks are behind the tasks of the large job.

• When we have enough processors to execute all tasks of the job at the front of queue, then the extra processor should select a smaller jobs whose tasks are behind the front job and fit to these processors.

Algorithm of AFCFS: Step1 : //Initializations

Let NSi is the number of slots for the ith Job. Let NTi is the number of tasks for the ith Job. Let N is the number of processors. Let Si is the size of slot time for the ith job. Let RP is the rest of processors in the slot. Let n is the number of Jobs.

Let incr is the incremental time to next slot. Step2: For each job i=1,…,n in the queue. Step2-1: Assign Job i to the slot; Step2-2: Remove Job i from the queue; Step2-3 : if NTi > N then {

Find

=N

iNTiNS ;

Find ( )N%iNTNRP −= ; } Else { NS i=1; RP=N - NTi; } Step2-4 : Set incr=Si * NSi ; Step2-5: RPNTJobqueuetheinJob jj ≤∋∃∀

Step2-5-1: Assign Jobj to the slot; Step2-5-2: Remove Jobj from the queue; Step2-5-3: if Si < Sj then { incr =incr + ( Sj – Si ); RP=RP-NTj ; } Step2-5-4: if RP>0 goto Step2-5 ; Step2-4: Increment time by incr; Step3: End.

job J5 J2 J1 J6 J4 J3 Number of task in job(size of gang) 8 7 5 8 2 10 Slot time 10 10 5 5 8 10

The above model is implemented on pool model. // why?

{ }

ijobfromtaskjT

CPUsofnumberm

indexs'jobisi

thatsuchT...,,T,TS

followingthecontainsSslotA

thij

im2i1ii

i

−−

−=

N – number of jobs Response time of job j

( ) ( )

( )

66.286

1726

r

ART

5310/10*)053(3r

338/8*)033(6r

282/2*)028(4r

285/5*)028(1r

207/7*)020(2r

108/8*0105r

n/tslottheatendtimeservicen/t1

)usedCPU/(#tasksof#taskoneforRTjobofRTr

6

1jj

jj

j

===

=−==−==−=

=−==−=

=−=

×=×

λ−µ=

×==

∑=

25.3040

12101028578

5310338282285207108AWRT

tasks#)xj(p

==+++++

×+×+×+×+×+×=

→

113.0536

periodtimeprocess#

Throughput ===

1.1940

764

8821075

828802201033710520AWWT

5.186/111AWT

285336w

010105w

208284w

3320533w

1010202w

208281w

timeSlotSj,timearrivalA,timedepartureDwhere

,SjAjDjwj

==+++++

×+×+×+×+×+×=

===−==−=

=−==−==−=

=−=−−−

−−=


65.220/533S

6.65/336S

5.38/284S

5.38/281S

210/202S

110/105S

========

====

145.340

8.1258821075

86.68125.31065.27255.3WSLD ==

+++++×+×+×+×+×+×=

b-Largest-Job-First-Served (LJFS)

With this policy tasks are placed in increasing gang size order in processor queues (tasks that belong to larger gangs are placed at the head of queues). All tasks in queues are searched in order, and the first jobs whose assigned processors are available begin execution.

• Sorting jobs according to the largest number of task • If the number of tasks is equal for at least two jobs, then we should select a

job with the largest slot time.

job J5 J2 J1 J6 J4 J3 Number of task in job(size of gang) 8 7 5 8 2 10 Slot time 10 10 5 5 8 10

( ) ( )

( )

33.396

2366

r

ART

532/2*)053(4r

535/5*)053(1r

457/7*)045(2r

358/8*)035(6r

308/8*)030(5r

2010/10*0203r

n/tslottheatendtimeservicen/tjj

1

)usedCPU/(#tasksof#taskoneforRTjobofRTrj

6

1jj

===

=−==−==−==−==−=

=−=

×=×

λ−µ=

×==

∑=

15.3540

14062578810

2535537458358301020AWRT

tasks#)xj(p

==+++++

×+×+×+×+×+×=

→

113.0536

periodtimejob#

Throughput ===

2440960

8821075830820245100735545

AWWT

166.296/175AWT

305356w

2010305w

458534w

020203w

3510452w

458531w

begintimeslotSj,timearrivalA,timedepartureD,SjAjDjwj

==+++++

×+×+×+×+×+×=

===−==−=

=−==−==−=

=−=−−−−−=


120/203S

75/356S

625.68/534S

625.68/531S

5.410/452S

310/305S

========

====

54.440

875.1818821075

81872625.65625.675.485WSLD ==

+++++×+×+×+×+×+×=

c- Shortest Time First (STF) method.

Places the shortest jobs on the top of processor queue. The job which the assigned processors are

available is executed, whereas the job on the top is considered first and then the following jobs

on the work queue.

• Sorting jobs according to the smallest number of task • If the number of tasks is equal for at least two jobs, then we should select a

job with the smallest slot time.

job J5 J2 J1 J6 J4 J3 Number of task in job(size of gang) 8 7 5 8 2 10

Slot time 10 10 5 5 8 10

( ) ( )

( )

66.246

1486

r

ART

5310/10*)053(3r

338/8*)033(6r

288/8*)028(5r

187/7*)018(2r

82/2*)08(4r

85/5*081r

n/tslottheatendtimeservicen/tjj

1

)usedCPU/(#tasksof#taskoneforRTjobofRTrj

6

1jj

===

=−==−==−==−=

=−==−=

×=×

λ−µ=

×==

∑=

3040

12002578810

83382828105371858AWRT

tasks#)xj(p

==+++++

×+×+×+×+×+×=

→

113.0536

periodtimejob#

Throughput ===

85.1840754

88210758288182010337850

AWWT

5.146/87AWT

285336w

1810285w

0884w

3320533w

810182w

0881w

begintimeslotSj,timearrivalA,timedepartureD,SjAjDjwj

==+++++

×+×+×+×+×+×=

===−==−=

=−==−==−=

=−=−−−−−=


65.220/533S

6.65/336S

18/84S

18/81S

8.110/182S

8.210/285S

====

====

====

03.340

3.1218821075

1065.286.6215178.188.2WSLD ==

+++++×+×+×+×+×+×=

AWT

0

5

10

15

20

25

30

35

STFLJFAFCFS

AWWT

0

5

10

15

20

25

30

STFLJFAFCFS

WSLD

00.511.522.533.544.55

STFLJFAFCFS

ART

0

5

10

15

20

25

30

35

40

45

STFLJFAFCFS

AWRT

27

28

29

30

31

32

33

34

35

36

STFLJFAFCFS

Homework

Q/Find system performance for each algorithm. Q/Discuss the above results.

Q/show the output of the priority scheduling according to the following timetable:

job J5 J2 J1 J6 J4 J3 Number of task in job(size of gang) 8 7 5 8 2 10 Slot time 10 10 5 5 8 10 Priority 2 1 4 6 5 3

The highest is 1 and the lowest is 6. implemented on distributed system with 8 processors.

processor allocation - philadelphia university · example using amdahl’s law • suppose that a...

Documents