scheduling and resource management for next- generation clusters yanyong zhang penn state university...
TRANSCRIPT
![Page 1: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/1.jpg)
Scheduling and Resource Management for Next-
generation Clusters
Yanyong ZhangPenn State University
www.cse.psu.edu/~yyzhang
![Page 2: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/2.jpg)
What is a Cluster?
•Cost effective
•Easily scalable
•Highly available
•Readily upgradeable
![Page 3: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/3.jpg)
Scientific & Engineering Applications
• HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm)
• Sandia's expansion of their Alpha-based C-plant system.
• Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm)
• A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 ….
(http://www.swiss.ai.mit.edu/~pas/p/sc95.html)
• The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide ….
(http://www.osc.edu/press/releases/2001/approved.shtml)
![Page 4: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/4.jpg)
Commercial Applications
• Business applications– Transaction Processing (IBM DB2, oracle …)– Decision Support System (IBM DB2, oracle …)
• Internet applications– Web serving / searching (Google.Com …)– Infowares (yahoo.Com, AOL.Com)– Email, eChat, ePhone, eBook,eBank, eSociety,
eAnything– Computing portal
![Page 5: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/5.jpg)
Resource Management
• Each application is demanding• Several applications/users can
be present at the same time
Resource management and Quality-of-service become important.
![Page 6: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/6.jpg)
4
System ModelArrival Q
43
• Each node is independent• Maximum MPL• Arrival queue
High Speed
Network
P0 P1 P2 P3 P4
![Page 7: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/7.jpg)
Two Phases in Resource Management• Allocation Issues
– Admission Control– Arrival Queue Principle
• Scheduling Issues (CPU Scheduling)– Resource Isolation– Co-allocation
![Page 8: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/8.jpg)
SEND
switch
Co-allocation / Co-scheduling
P0 P1
TIME
t0
t1
P0RECV
Scheduling skewness
![Page 9: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/9.jpg)
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
![Page 10: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/10.jpg)
Contribution 1:Boosting CPU Utilization at Supercomputing Centers
![Page 11: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/11.jpg)
Wait Time Execute Time
Objective
Wait in the arrival Q
Wait in the ready/blocked
Q
Response Time
slowdown =Response Time
Execute Time in Isolation
minimize
![Page 12: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/12.jpg)
• Back Filling (BF)
• Gang Scheduling (GS)
• Migration (M)
Existing Techniques
2 6
5
23
# of CPUs = 14
8283 2
6
tim
e
space2 2
![Page 13: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/13.jpg)
Proposed Scheme
• MBGS = GS + BF + M– Use GS as the basic framework– At each row of GS matrix, apply
BF technique– Whenever GS matrix is re-
calculated, M should be considered.
![Page 14: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/14.jpg)
How Does MBGS Perform?
![Page 15: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/15.jpg)
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
![Page 16: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/16.jpg)
Contribution 2:Reducing Response Times for Commercial Applications
![Page 17: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/17.jpg)
Wait Time Execute Time
Objective
Wait in the arrival Q
Wait in the ready/block
ed Q
Response Time
•Minimize wait time•Minimize response time
![Page 18: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/18.jpg)
Previous Work I:Gang Scheduling (GS)
GS is not responsive enough !
(1)
(2)
MINUTES !
wasted
![Page 19: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/19.jpg)
Previous Work II:Dynamic Co-scheduling
B D A C
P0 P1 P2 P3
B just gets a msg
Everybody else is blocked
It’s A’s tur
n
C just finishes I/O
The scheduler on each node makes independentdecision based on local events without global synchronizations.
![Page 20: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/20.jpg)
Dynamic Co-scheduling Heuristics
How do you wait for a message?
What doyou do onmessagearrival?
No ExplicitReschedule
Interrupt &Reschedule
PeriodicallyReschedule
Busy Wait Spin Block Spin Yield
Local
SB SY
DCS DCS-SB DCS-SY
PB PB-SB PB-SY
![Page 21: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/21.jpg)
Simulation Study
• A detailed simulator at a microsecond granularity
• System parameters– System configurations (maximum
MPL, to partition or not)– System overheads (context switch
overheads, interrupt costs, costs associated with manipulating queues)
![Page 22: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/22.jpg)
Simulation Study (Cont’d)
• Application parameters– Injection load– Characteristics (CPU intensive, IO
intensive, communication intensive or somewhere in the middle)
![Page 23: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/23.jpg)
Impact of Load
![Page 24: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/24.jpg)
Impact of Workload Characteristics
Comm intensive I/O intensive
![Page 25: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/25.jpg)
Periodic Boost Heuristics
• S1: Compute Phase• S2: S1 + Unconsumed
Msg.• S3: Recv. + Msg.
Arrived• S4: Recv. + No Msg.
• A: S3-> {S2,S1}• B: S3->S2->S1• C: {S3,S2,S1}• D: {S3,S2}->S1• E: S2->S3->S1
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Ave
rage
Job
Res
pon
se T
ime
(X10
000
seco
nd
s)
A B C D E
![Page 26: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/26.jpg)
Analytical Modeling Study
• The state space is impossible to handle.
High Speed
Network
P0 P1 P2 P3 Pp
… …
Dynamic arrival
![Page 27: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/27.jpg)
Analysis Descriptioni
X i, jA, j1B,…,jP
Bi+,
jA1, …, mA,
number of nodes
_ _
jkB
_ ik, jk,1B ,…,jk,Bik M , ik1,…,iM,jk
R,_
1,…,iM,jkR(l)
_
jk,l1,…,N,B jk 1,…,mQ+mO, k1,…,P, N Q ll=1
n
Original State Space (impossible to handle!!)
Assumption: The state of each processor is stochastically independent and identical to thestate of the other processors.
i, ,…, jiM,jQ jA,
Reduced State Space (much more tractable !! )
iY jR,j1
B_
B i+, jA1, …, mA, jR(l)1,…,iM,_
jkB1,…,N, jQ
1,…,mQ+mO
Number of jobs on node k
![Page 28: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/28.jpg)
Analysis Description (Cont)
Address the state transition rates usingContinuous Markov model; Build the
Generator Matrix Q
Get the invariant probability vector by
solving Q = 0, and e = 1.
Use fixed-point iteration to get the solution
![Page 29: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/29.jpg)
SB Example
1 C2 C
2 C1 IO
2 C1 C
1 IO2 IO
1
2 C1 IO
1 SN2 CQ
1
Q1
1 C2 C
2 C1 SN
1xP 1
1x(1-P1)Q…
…
…
…
1 SP2 C
1 C2 C
r 1
2 C1 B
1
2 C1 SP
Q
1 B2 IO
1
1 C2 C
r1’
2 C1 B
Q
…
… …
…
…
…
r1 = P( )x1 C2 * 1/1+1/1+1/1
1 +{P( )+P( )}x1 IO2 *
2 *1 IO
1
1/1+1/1
+P( )x1 SN2 *
1
r2 = …
![Page 30: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/30.jpg)
Results
Optimal PB Frequency Optimal Spin Time for SB
![Page 31: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/31.jpg)
Results – Optimal Quantum Length
Comm Intensive
CPU Intensive
I/OIntensive
![Page 32: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/32.jpg)
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Contribution 4: optimizing clustered
DB2
NEXT
![Page 33: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/33.jpg)
Contribution 3:Scheduling Multiple Classes of Applications
realtime
interactive
batch
![Page 34: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/34.jpg)
Objective
cluster
BE
RTHow long did it take me to finish?? Response time
How many deadlines have been missed? Miss rate
![Page 35: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/35.jpg)
Fairness Ratio (x:y)
RT
BE
Cluster Resource xx+y
yx+y
![Page 36: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/36.jpg)
How to Adhere to Fairness Ratio?
RT1RT2
BE
RT
BE1GS 2DCS-TDM 2DCS-PS
x:y = 2:1
tim
e
tim
e
tim
e
P0 P1 P0 P1P0 P1
![Page 37: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/37.jpg)
BE response time
RT : BE = 2:1 RT : BE = 1:9
RT : BE = 9:1
![Page 38: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/38.jpg)
RT Deadline Miss Rate
RT : BE = 2:1 RT : BE = 1:9
RT : BE = 9:1
![Page 39: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/39.jpg)
• From OS’s perspective– Contribution 1: boosting the CPU utilization at
supercomputing centers– Contribution 2: providing quick responses for
commercial workloads– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective– Characterizing decision support workloads on
the clustered database server– Resource management for transaction
processing workloads on the clustered database server
Outline
NEXT
![Page 40: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/40.jpg)
Experiment Setup
• IBM DB2 Universal Database for Linux, EEE, Version 7.2
• 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node.
• TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.
![Page 41: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/41.jpg)
Myrinet
Server
Platform
Client
001A 002B 003C 004D
004D
003C
002B
001A
Table T
Select * from T
coordinator node
1
3 3 3 3 34
4 4 422 2 2
5
004D
003C
002B
001A
![Page 42: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/42.jpg)
Methodology
• Identify the components with high system overhead.
• For each such component, characterize the request distribution.
• Come up with ways of optimization.
• Quantify potential benefits from the optimization.
![Page 43: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/43.jpg)
Sampling OS Statistics
• Sample the statistics provided by stat, net/dev, process/stat.– User/system CPU %– # of pages faults– # of blocks read/written– # of reads/writes– # of packets sent/received– CPU utilization during I/O
![Page 44: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/44.jpg)
Kernel Instrumentation
• Instrument each system call in the kernel.
Enter system call
block
unblock
resumeexecution
Exitsystem call
![Page 45: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/45.jpg)
Operating System Profile
• Considerable part of the execution time is taken by pread system call.
• There is good overlap of computation with I/O for some queries.
• More reads than writes.
![Page 46: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/46.jpg)
TPC-H pread OverheadQuery
% of exe time
Query
% of exe time
Q6 20.0 Q13 10.0
Q14 19.0 Q3 9.6
Q19 16.9 Q4 9.1
Q12 15.4 Q18 9.0
Q15 13.4 Q20 7.9
Q7 12.1 Q2 5.2
Q17 10.8 Q9 5.2
Q8 10.5 Q5 4.6
Q10 10.3 Q16 4.1
Q1 10.0 Q11 3.5
pread overhead = # of preads X overhead per pread.
![Page 47: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/47.jpg)
pread Optimization
user space
pagecache 1
2
pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest }}
pagetable
Optimization:•Re-mapping the buffer•Copy on write
30s
![Page 48: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/48.jpg)
Copy-on-write
user space
pagecache
read only
Query
% reduction
Query % reduction
Q1 98.9 Q11 96.1
Q2 85.7 Q12 87.1
Q3 96.0 Q13 100.0
Q4 80.9 Q14 96.1
Q5 100.0 Q15 96.8
Q6 100.0 Q16 70.7
Q7 79.7 Q17 94.5
Q8 79.3 Q18 100.0
Q9 88.7 Q19 95.7
Q10 77.8 Q20 94.4
# of copy-on-write
# of preads% reduction = 1 -
![Page 49: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/49.jpg)
Operating System Profile
• Socket calls are the next dominant system calls.
![Page 50: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/50.jpg)
Message Characteristics
Q11
Q16
Message Size (bytes)
Message Inter-injectionTime (Millisecond)
Message Destination
![Page 51: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/51.jpg)
Observations on Messages
• Only a small set of message sizes is used.
• Many messages are sent in a short period.
• Message destination distribution is uniform.
• Many messages are point-to-point implementations of multicast/broadcast messages.
• Multicast can reduce # of messages.
![Page 52: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/52.jpg)
Potential % Reduction in Messages
query
total
small
large query
total
small
large
Q1 44.7 71.4
38.7 Q11 9.6 28.6
0.1
Q2 20.4 58.7
0.2 Q12 8.3 7.8 2.9
Q3 48.2 64.3
38.0 Q13 24.5
75.2
0.1
Q4 22.6 58.6
0.1 Q14 27.9
80.4
0.7
Q5 8.0 7.1 8.4 Q15 46.6
56.5
0.7
Q6 76.4 78.6
45.5 Q16 59.1
63.0
56.9
Q7 57.5 71.4
56.2 Q17 41.5
66.7
27.3
Q8 29.1 75.5
4.8 Q18 11.4
32.3
0.0
Q9 66.8 78.5
61.1 Q19 26.7
79.4
0.2
Q10 25.0 73.6
0.1 Q20 21.1
62.8
0.1
![Page 53: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/53.jpg)
Send ( msg, dest ) { if (msg = buffered_msg && dest dest_set) dest_set = dest_set { dest } ; else buffer the msg; }
Send_bg () { foreach buffered_msg if ( it has been buffered longer than threshold ) send multicast msg to nodes in dest_set;}
Online AlgorithmSend ( msg, dest ) { send msg to node dest;}
![Page 54: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/54.jpg)
Impact of ThresholdQ7 Q16
Threshold (millisecond) Threshold (millisecond)
![Page 55: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/55.jpg)
Outline• From OS’s perspective
– Contribution 1: boosting the CPU utilization at supercomputing centers
– Contribution 2: providing quick responses for commercial workloads
– Contribution 3: scheduling multiple classes of applications
• From application’s perspective– Characterizing decision support workloads on
the clustered database server– Resource management for clustered database
applications NEXT
![Page 56: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/56.jpg)
Ongoing/Near-term Work
• What is the optimal number of jobs which should be admitted?
• Can we dynamically pause some processes based on resource requirement and resource availability?
• Which dynamic co-scheduling scheme works best here?
• How do we exploit application level information in scheduling?
![Page 57: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/57.jpg)
• Some next-generation applications– Real time medical imaging and collaborative surgery
Future Work
Application requirements:• VAST processing power, disk capacity and network bandwidth• absolute availability• deterministic performance
![Page 58: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/58.jpg)
Future Work– E-business on demand
Requirements:• performance
more users responsive Quality-of-service
• availability• security• power consumption• pricing model
![Page 59: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/59.jpg)
Future Work
• What does it take to get there?– Hardware innovations– Resource management and
isolation– Good scalability– High availability– Deterministic Performance
![Page 60: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/60.jpg)
Future Work
• Not only high performance– Energy consumption– Security– Pricing for service – User satisfaction– System management– Ease of use
![Page 61: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/61.jpg)
Related Work
• parallel job scheduling: – Gang Scheduling [Ousterhout82]– Backfilling ([Lifka95], [Feitelson98]) – Migration ([Epima96])
• Dynamic co-scheduling: – Spin Block ([Arpaci-Dusseau98],
[Anglano00]), – Periodic Boost ([Nagar99])– Demand-based Coscheduling
([Sobalvarro97]),
![Page 62: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/62.jpg)
Related Work (Cont’d)
• Real-time Scheduling: – Earliest Deadline First– Rate Monotonic– Least Laxity First
• Single node Multi-class scheduling– Hierarchical scheduling ([Goyal96])– Proportional share ([Waldspurger95])
• Commercial clustered server (Pai[98], reserve)
![Page 63: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/63.jpg)
Related Work (Cont’d)
• Commercial Workloads (CAECW, [Barford99], Kant[99])
• Database Characterizing ([Keeton99], [Ailamaki99], [Rosenblum97])
• OS support for database ([Stonebraker81], [Gray78], [Christmann87])
• Reducing copies in IO ([Pai00], [Druschel93], [Thadani95])
![Page 64: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/64.jpg)
Publications
• IEEE Transactions on Parallel and Distributed Systems.
• International Parallel and Distributed Processing Symposium (IPDPS 2000)
• ACM International Conference on Supercomputing (ICS 2000)
• International Euro-par Conference (Europar 2000)• ACM Symposium on Parallel Algorithms and
Architectures (SPAA 2001)• Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP 2001)• Workshop on Computer Architecture Evaluation
Using Commercial Workloads (CAECW 2002)
![Page 65: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/65.jpg)
Publications I:Batch Applications
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and Migration, 7th Workshop on Job Scheduling Strategies for Parallel Processing.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. Proceedings of 6th International Euro-Par Conference Lecture Notes in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by combining Gang Scheduling and Backfilling Techniques. International Parallel and Distributed Processing Symposium (IPDPS'2000), pages 133-142, May 2000.
• Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems. Submitted to IEEE Transactions on Parallel and Distributed Systems.
![Page 66: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/66.jpg)
Publications II:Interactive Applications
• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Penn State CSE tech report CSE-01-004.
• Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and Distributed Systems.
• Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulation-based Performance Study of Cluster Scheduling Mechanisms. 14th ACM International Conference on Supercomputing (ICS'2000), pages 100-109, May 2000.
• M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Submitted to ACM Transactions on Modeling and Compute Simulation (TOMACS).
![Page 67: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/67.jpg)
Publications III:Multi-class Applications• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort
and Real-Time Pipelined Applications on Time-Shared Clusters, the 13th Annual ACM symposium on Parallel Algorithms and Architectures.
• Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, Submitted to IEEE Transactions on Parallel and Distributed Systems.
![Page 68: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/68.jpg)
Publications IV:Database• Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H.
Franke. Decision-Support Workload Characteristics on a Clustered Database Server from the OS Perspective. Penn State Technical Report CSE-01-003
![Page 69: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/69.jpg)
Thank You !
![Page 70: Scheduling and Resource Management for Next- generation Clusters Yanyong Zhang Penn State University yyzhang](https://reader030.vdocuments.site/reader030/viewer/2022032607/56649ed55503460f94be5fca/html5/thumbnails/70.jpg)
I/O Characteristics (Q6)