decentralized resource management for multi-core desktop grids jaehwan lee, pete keleher, alan...
TRANSCRIPT
Decentralized Resource Management for Multi-core Desktop Grids
Jaehwan Lee, Pete Keleher, Alan Sussman
Department of Computer Science University of Maryland
Multi-core is not enough
• Multi-core CPU is the current trend of desktop computing
• Not easy to exploit Multi-core in a single machine for high throughput computing “Multicore is Bad news for Supercomputers”, S. Moore,
IEEE Spectrum, 2008• No decentralized solution for Multi-core Grids
How to exploit Multi-core in Peer-to-peer Grids?
Challenges in Multi-core P2P Grids• Feature of Structured P2P Grids
For effective matchmaking, structured P2P platform based on Distributed Hash Table (DHT) is needed
Structured DHT is susceptible to frequent dynamic update for node’s status
• How to represent a multi-core node in P2P structure ? If a distinct logical peer represent s a core,
• Cannot support multi-thread jobs• Cannot accommodate jobs requiring large shared resources
If a logical peer represents a multi-core machine,• Contention for shared resources among each cores• Can waste some of cores due to misled matchmaking
Needs to advertise dynamic status for residual resource
• Contention for Shared Resources No simple model for p2p grid
Our Contributions
• Decentralized Resource Management Schemes for Multi-core Grids Two logical nodes for a physical machine Dual-CAN & Balloon Model for p2p structure
• New Matchmaking & Load Balancing Scheme
• Simple Analytic Model for a Multi-core Node Contention for shared resources
Outline
• Background• Decentralized Resource Management for
Multi-core Grids• Simulation Model• Experimental Results• Related work• Conclusion & Future Work
P2P Desktop Grid System
P2P Network
CPU
Memory
Disk
Content-Addressable Network (CAN)
Decentralized Matchmakingand Load balancing?
Job JCPU 2.0GHz
Mem 500MBDisk 1GB
Overall System Architecture
• P2P grids
Clients
OwnerNode
RunNode
InjectionNode
Insert Job J
Job J
Route Job J
Assign GUID to Job J
J
FIFO Job Queue
Peer-to-PeerNetwork
(DHT - CAN)
Peer-to-PeerNetwork
(DHT - CAN)
Initiate
Find
Matchmaking
Send Job J
Heartbeat
Heartbeat
Matchmaking Mechanism in CAN
CPU
Memory
C F I
A
B E
D G
H
Job J
Owner
Job JCPU >= CJ
&&Memory >= MJ
CJ
MJ
Client
Insert J
PushingJob J
Run Node
J
FIFO Queue
Heartbeat
Outline
• Background• Decentralized Resource Management
for Multi-core Grids• Simulation Model• Experimental Results• Related work• Conclusion & Future Work
Two logical nodes• Max-node: the maximum values for each resource
A static point in the CAN (like a single-core case)
• Residue-node: the current available resources Dynamic usage status for the node Always a free node If a node is free (has no job in the queue) or totally busy (all cores are running jobs),
Residue-node does not exist : Residue-node is much fewer than Max-node
CPU: 2GHzMem: 4GB # of CPUs: 4
CPU: 2GHzMem: 3.3GB# free CPUs: 3
Quad-core node
CPU: 2GHzMem: 4GB# free CPUs: 4Free Node
CPU: 1.2GHzMem:0.7GB
Job 1 Max-nodeResidue-node
CPU: 1.5GHzMem:1.2GB
Job 2
CPU: 2GHzMem: 4GB # of CPUs: 4
CPU: 2GHzMem: 2.1GB# free CPUs: 2
CPU: 2GHzMem:1.5GB
Job 3
CPU: 2GHzMem: 4GB # of CPUs: 4
CPU: 2GHzMem: 0.6GB# free CPUs: 1
CPU: 1.5GHzMem:0.2GB
Job 4
CPU: 2GHzMem: 4GB # of CPUs: 4
Dual-CAN• Primary CAN : composed of Max-nodes
The same as original CAN in single-core case (Static)• Secondary CAN : composed of Residue-nodes
Fewer nodes in Secondary CAN (dynamic)• Example
a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB)
Qlen=1
CPU
Memory
2GHz
1GB
B’
Residue- node
Free
CPU: 2GHzMem:2GB
Job 1
CPU
Memory
2GHz
3GB
2GB
1.5GHz
A B
Max-node
Primary CAN Secondary CAN
Balloon Model• Balloons : light-weight structure for residue-nodes
Only keeping the coordinates (current available resource) and load information
Attached to a zone in the (Primary) CAN No CAN join & leave and exchanging updates is necessary
• Example a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB)
Qlen=1
Memory
CPU1.5GHz 2GHz
2GB
3GB
AB
Max-node
1GB
B’
FreeResidue-
node
CPU: 2GHzMem:2GB
Job 1
Computing Aggregated Load
CPU Dimension
Memory Dimension
Node A Node B
Node C Node D
Node E
Aggregation of load information
(Number of Nodes, Balloons & Cores,
Sum of used Cores)
Decision Algorithms - Pushing
• Target Node (Where?) Smaller aggregated average core utilization and Larger available
number of cores
• Stopping Criteria (When?) Found a free node Probabilistic stopping
• Criteria for the Best Run Node (Which?) Among the free nodes : the node with the fastest CPU If no free nodes: the fastest Balloon or a node in Secondary CAN Using a score function : prefer less core-utilization and faster
CPU
CPUSecondary CAN CJ
RMemory
MJ S
Pushing : Dual-CAN
CPU Dimension
Memory Dimension
A O
B
DC
Owner
Job J
Job JCPU CJ
Mem MJ
Client
CJ
MJ
Aggregatingload information
Aggregatingload information
Job JRun
Node
Primary CAN
Stop!
Pushing : Balloon Model
CPU Dimension
Memory Dimension
A B O
C
SD
Owner
Job J
Job JCPU CJ
Mem MJ
Client
CJ
MJ
Aggregatingload information
Aggregatingload information
Job JRun
Node (Balloon)Stop!
Model comparisonDual-CAN Balloon Model
Matchmaking Performance
Better than Balloon(Searching for Residue-nodes is global in the Secondary CAN)
Searching for residue-nodes(Balloons) is limited in one or two hop(s) neighbors
Overheads CAN maintenance overheads for the Secondary CAN is needed. (proportional to the number of nodes in the Secondary CAN)
Less than Dual-CAN(Simple operation for Balloon Join & Leave, Balloon update is one-way heartbeat )
Outline
• Background• Decentralized Resource Management for
Multi-core Grids
• Simulation Model• Experimental Results• Related work• Conclusion & Future Work
Contention for shared resources (the worst case)
• Contention for shared resources (memory, I/O) can worsen overall performance in multi-core CPU If the jobs are extremely memory intensive, the performance can drop
drastically.
• What is STREAM ? The benchmark test to measure memory bandwidth Generate extremely memory-intensive jobs (copy, scale, add and triad)
• Experiments (1) Run a memory-intensive job (STREAM) on a dual-core CPU (leave
other core idle) (2) Run two memory intensive jobs on a dual-core CPU simultaneously Compare running times of (1) & (2) On average, running time of (2) is longer than that of (1) by 2.09 times
Effect of Contention with general scientific computing jobs
• Alam et al’s experiment Run several benchmark test (NAS,AMBER,LAMMPS,POP) for
scientific computing on a dual-core machine Compare running a task on a dual-core and running two
tasks on a dual-core Running time for two tasks is higher by 3.8% to 27%
(average : 10.97%)
• SPEC CPU2006 Experiment Do the same experiment with SPEC CPU2006 benchmark
test on a dual-core machine Running increment is 6% (with gcc compiler) and 10% (with
icc compiler) on average
Our Simulation Model
• Assumption The job requiring more
memory is likely more memory-intensive.
• For the worst case running time can increase
by n times (n: the number of cores)
• For the general case running time can increase
by p% (p=10 from previous experiments)
CiRi
x n = Ω
n
1 + p
1
αRunning time ratio
α: running time incrementn : the number of cores p : contention penalty from experiment resultRi : amount of Resource i in the nodeCi : sum of job requirement for resource iΩ: contention penalty
Outline
• Background• Decentralized Resource Management for
Multi-core Grids• Simulation Model
• Experimental Results• Related work• Conclusion & Future Work
Experimental Setup• Event-driven Simulations
A set of nodes and events • 1000 initial nodes and 5000 job submissions• Jobs are submitted with a Poisson distribution• A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution
(30mins~90mins) Node Capability (Job Requirements)
• CPU, Memory, Disk and the number of cores Steady state experiments
Experimental Setup
• Performance metrics
• Matchmaking Frameworks CAN (Dual-CAN & Balloon Model) Multiple Peers (MP) and Centralized Matchmaker (CENT)
Job Turn-around Time
Injectedinto
the system
Arrives atOwner
Arrives atRun Node
Starts Execution
Finishes Execution
MatchmakingCost
WaitTime
RunningTime
Comparison Models• Centralized Matchmaker (CENT)
Online and global scheduling mechanism Not feasible in a complete implementation of P2P system
• Multiple Peers (MP) An individual peer on each core with equally divided
shared resources Current Condor’s strategy
Job JCPU 2.0GHz
Mem 500MBDisk 1GB
CentralizedMatchmaker
Result (1) – Completeness
• Single thread jobs
• Dual-CAN & Balloon model can run all jobs
• MP: 80% completeness
• For fair comparison Submit jobs which can be
run on MP to Balloon or Dual-CAN (Balloon-L, Dual-CAN-L)
• Balloon-L ,Dual-CAN-L,MP shows similar performance
• MP cannot meet completeness
Cost (1) - Overheads
• Cost: # of messages, volume of messages• MP is higher than the two schemes
Cost is proportional to the number of peers• Balloon model is cheaper than Dual-CAN
Result (2) – Load Balance
• Multi-thread jobs• Load balancing
Performance Dual-CAN > CENT >
Balloon model• Why CENT is worse?
CENT is based on a Greedy algorithm
(over-provisioning)
Cost (2) - Overhead
• Vanilla : The cost without additional costs incurred by Balloon or Dual-CAN
• Costs Dual-CAN > Balloon model == Vanilla
Evaluation Summary
• Performance Completeness : Dual-CAN, Balloon (MP cannot) Load Balance : Dual-CAN >= Balloon == CENT
(competitive load balance)• Overheads
MP >> Dual-CAN >= Balloon (the number of peers) Dual-CAN >= Balloon == Vanilla (Low overhead)
Related Work
• Time-to-Live (TTL) based mechanisms Caromel et al. (Parallel Computing, 2007), Mastroianni
et al (EGC, 2005). Lack of Completeness
• Encoding resource information using DHT Cheema et al. (Grid, 2005), CompuP2P (TPDS, 2006) Lack of Load balance and Parsimony
• Grids for Multi-core Desktop Condor : static partitioning to handle a multi-core
node as a set of independent entities
Conclusion and Future Work
• New decentralized Resource management for Multi-core P2P Grids Two logical nodes for static & dynamic feature Dual-CAN and Balloon Model
• Simple analytic model for multi-core simulation considering resource contention
• Evaluation via simulation Completeness (better than Multiple Peers) Load-balance (competitive with Centralized Matchmaker) Low overhead
• Future work Real experiment (co-operation with Astronomy Dept.) Resource Management for Heterogeneous Multi-processors
Decentralized Resource Management for Multi-core Desktop Grids
Jaehwan Lee, Pete Keleher, Alan Sussman
Department of Computer Science University of Maryland
Decision Functions
Target Node (Where?)
Criteria for the Best Run Node (Which?) Among the free nodes OR
Stopping Criteria (When?) Found a free node OR
Aggregated load information along the dimension d
Aggregated load information along the dimension d
Objective function to minimize
Objective function to minimize
Score function for a candidate
run node C
Score function for a candidate
run node C
Stopping Factor
Stopping Factor
Probability to stop pushing from node N
Probability to stop pushing from node N
Target Dimension
Target Dimension