decentralized resource management for multi-core desktop grids jaehwan lee, pete keleher, alan...

Decentralized Resource Management for Multi-core Desktop Grids

Jaehwan Lee, Pete Keleher, Alan Sussman

Department of Computer Science University of Maryland

Multi-core is not enough

• Multi-core CPU is the current trend of desktop computing

• Not easy to exploit Multi-core in a single machine for high throughput computing “Multicore is Bad news for Supercomputers”, S. Moore,

IEEE Spectrum, 2008• No decentralized solution for Multi-core Grids

How to exploit Multi-core in Peer-to-peer Grids?

Challenges in Multi-core P2P Grids• Feature of Structured P2P Grids

For effective matchmaking, structured P2P platform based on Distributed Hash Table (DHT) is needed

Structured DHT is susceptible to frequent dynamic update for node’s status

• How to represent a multi-core node in P2P structure ? If a distinct logical peer represent s a core,

• Cannot support multi-thread jobs• Cannot accommodate jobs requiring large shared resources

If a logical peer represents a multi-core machine,• Contention for shared resources among each cores• Can waste some of cores due to misled matchmaking

Needs to advertise dynamic status for residual resource

• Contention for Shared Resources No simple model for p2p grid

Our Contributions

• Decentralized Resource Management Schemes for Multi-core Grids Two logical nodes for a physical machine Dual-CAN & Balloon Model for p2p structure

• New Matchmaking & Load Balancing Scheme

• Simple Analytic Model for a Multi-core Node Contention for shared resources

Outline

• Background• Decentralized Resource Management for

Multi-core Grids• Simulation Model• Experimental Results• Related work• Conclusion & Future Work

P2P Desktop Grid System

P2P Network

CPU

Memory

Disk

Content-Addressable Network (CAN)

Decentralized Matchmakingand Load balancing?

Job JCPU 2.0GHz

Mem 500MBDisk 1GB

Overall System Architecture

• P2P grids

Clients

OwnerNode

RunNode

InjectionNode

Insert Job J

Job J

Route Job J

Assign GUID to Job J

J

FIFO Job Queue

Peer-to-PeerNetwork

(DHT - CAN)

Peer-to-PeerNetwork

(DHT - CAN)

Initiate

Find

Matchmaking

Send Job J

Heartbeat

Heartbeat

Matchmaking Mechanism in CAN

CPU

Memory

C F I

A

B E

D G

H

Job J

Owner

Job JCPU >= CJ

&&Memory >= MJ

CJ

MJ

Client

Insert J

PushingJob J

Run Node

J

FIFO Queue

Heartbeat

Jik-Soo Kim

This figure is an illustration purpose~!!Actual CAN may be irregularly partitioned.Make sure that describing all of following items:1) We are using each resource type as a separate dimension with additonal virtual dimension2) As nodes are mapped to the CAN space, they divide up the whole space and maintain their own zones and neighbors. Thus, each node's zone contains the point for that node3) Each node periodically sends update messages to its direct neighbors and maintains indirect neighbors for failure recovery4) Pushing mechanism is used for better load balancing and we propagate global information and aggregate it along each dimension

Outline

• Background• Decentralized Resource Management

for Multi-core Grids• Simulation Model• Experimental Results• Related work• Conclusion & Future Work

Two logical nodes• Max-node: the maximum values for each resource

A static point in the CAN (like a single-core case)

• Residue-node: the current available resources Dynamic usage status for the node Always a free node If a node is free (has no job in the queue) or totally busy (all cores are running jobs),

Residue-node does not exist : Residue-node is much fewer than Max-node

CPU: 2GHzMem: 4GB # of CPUs: 4

CPU: 2GHzMem: 3.3GB# free CPUs: 3

Quad-core node

CPU: 2GHzMem: 4GB# free CPUs: 4Free Node

CPU: 1.2GHzMem:0.7GB

Job 1 Max-nodeResidue-node


Job 2



CPU: 2GHzMem:1.5GB

Job 3




Job 4


Dual-CAN• Primary CAN : composed of Max-nodes

The same as original CAN in single-core case (Static)• Secondary CAN : composed of Residue-nodes

Fewer nodes in Secondary CAN (dynamic)• Example

a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB)

Qlen=1

CPU

Memory

2GHz

1GB

B’

Residue- node

Free

CPU: 2GHzMem:2GB

Job 1

CPU

Memory

2GHz

3GB

2GB

1.5GHz

A B

Max-node

Primary CAN Secondary CAN

Balloon Model• Balloons : light-weight structure for residue-nodes

Only keeping the coordinates (current available resource) and load information

Attached to a zone in the (Primary) CAN No CAN join & leave and exchanging updates is necessary

• Example a single-core node A (1.5GHz,2GB) and a dual-core node B (2GHz,3GB)

Qlen=1

Memory

CPU1.5GHz 2GHz

2GB

3GB

AB

Max-node

1GB

B’

FreeResidue-

node

CPU: 2GHzMem:2GB

Job 1

Computing Aggregated Load

CPU Dimension

Memory Dimension

Node A Node B

Node C Node D

Node E

Aggregation of load information

(Number of Nodes, Balloons & Cores,

Sum of used Cores)

Decision Algorithms - Pushing

• Target Node (Where?) Smaller aggregated average core utilization and Larger available

number of cores

• Stopping Criteria (When?) Found a free node Probabilistic stopping

• Criteria for the Best Run Node (Which?) Among the free nodes : the node with the fastest CPU If no free nodes: the fastest Balloon or a node in Secondary CAN Using a score function : prefer less core-utilization and faster

CPU

PowerAdmin

Free nodes mean the nodes having no jobs in their job queues

CPUSecondary CAN CJ

RMemory

MJ S

Pushing : Dual-CAN

CPU Dimension

Memory Dimension

A O

B

DC

Owner

Job J

Job JCPU CJ

Mem MJ

Client

CJ

MJ

Aggregatingload information


Job JRun

Node

Primary CAN

Stop!

Pushing : Balloon Model

CPU Dimension

Memory Dimension

A B O

C

SD

Owner

Job J

Job JCPU CJ

Mem MJ

Client

CJ

MJ



Job JRun

Node (Balloon)Stop!

Model comparisonDual-CAN Balloon Model

Matchmaking Performance

Better than Balloon(Searching for Residue-nodes is global in the Secondary CAN)

Searching for residue-nodes(Balloons) is limited in one or two hop(s) neighbors

Overheads CAN maintenance overheads for the Secondary CAN is needed. (proportional to the number of nodes in the Secondary CAN)

Less than Dual-CAN(Simple operation for Balloon Join & Leave, Balloon update is one-way heartbeat )

Outline


Multi-core Grids

• Simulation Model• Experimental Results• Related work• Conclusion & Future Work

Contention for shared resources (the worst case)

• Contention for shared resources (memory, I/O) can worsen overall performance in multi-core CPU If the jobs are extremely memory intensive, the performance can drop

drastically.

• What is STREAM ? The benchmark test to measure memory bandwidth Generate extremely memory-intensive jobs (copy, scale, add and triad)

• Experiments (1) Run a memory-intensive job (STREAM) on a dual-core CPU (leave

other core idle) (2) Run two memory intensive jobs on a dual-core CPU simultaneously Compare running times of (1) & (2) On average, running time of (2) is longer than that of (1) by 2.09 times

Effect of Contention with general scientific computing jobs

• Alam et al’s experiment Run several benchmark test (NAS,AMBER,LAMMPS,POP) for

scientific computing on a dual-core machine Compare running a task on a dual-core and running two

tasks on a dual-core Running time for two tasks is higher by 3.8% to 27%

(average : 10.97%)

• SPEC CPU2006 Experiment Do the same experiment with SPEC CPU2006 benchmark

test on a dual-core machine Running increment is 6% (with gcc compiler) and 10% (with

icc compiler) on average

Our Simulation Model

• Assumption The job requiring more

memory is likely more memory-intensive.

• For the worst case running time can increase

by n times (n: the number of cores)

• For the general case running time can increase

by p% (p=10 from previous experiments)

CiRi

x n = Ω

n

1 + p

1

αRunning time ratio

α: running time incrementn : the number of cores p : contention penalty from experiment resultRi : amount of Resource i in the nodeCi : sum of job requirement for resource iΩ: contention penalty

Outline


Multi-core Grids• Simulation Model

• Experimental Results• Related work• Conclusion & Future Work

Experimental Setup• Event-driven Simulations

A set of nodes and events • 1000 initial nodes and 5000 job submissions• Jobs are submitted with a Poisson distribution• A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution

(30mins~90mins) Node Capability (Job Requirements)

• CPU, Memory, Disk and the number of cores Steady state experiments

Experimental Setup

• Performance metrics

• Matchmaking Frameworks CAN (Dual-CAN & Balloon Model) Multiple Peers (MP) and Centralized Matchmaker (CENT)

Job Turn-around Time

Injectedinto

the system

Arrives atOwner

Arrives atRun Node

Starts Execution

Finishes Execution

MatchmakingCost

WaitTime

RunningTime

Comparison Models• Centralized Matchmaker (CENT)

Online and global scheduling mechanism Not feasible in a complete implementation of P2P system

• Multiple Peers (MP) An individual peer on each core with equally divided

shared resources Current Condor’s strategy

Job JCPU 2.0GHz

Mem 500MBDisk 1GB

CentralizedMatchmaker

Result (1) – Completeness

• Single thread jobs

• Dual-CAN & Balloon model can run all jobs

• MP: 80% completeness

• For fair comparison Submit jobs which can be

run on MP to Balloon or Dual-CAN (Balloon-L, Dual-CAN-L)

• Balloon-L ,Dual-CAN-L,MP shows similar performance

• MP cannot meet completeness

Cost (1) - Overheads

• Cost: # of messages, volume of messages• MP is higher than the two schemes

Cost is proportional to the number of peers• Balloon model is cheaper than Dual-CAN

Result (2) – Load Balance

• Multi-thread jobs• Load balancing

Performance Dual-CAN > CENT >

Balloon model• Why CENT is worse?

CENT is based on a Greedy algorithm

(over-provisioning)

Cost (2) - Overhead

• Vanilla : The cost without additional costs incurred by Balloon or Dual-CAN

• Costs Dual-CAN > Balloon model == Vanilla

Evaluation Summary

• Performance Completeness : Dual-CAN, Balloon (MP cannot) Load Balance : Dual-CAN >= Balloon == CENT

(competitive load balance)• Overheads

MP >> Dual-CAN >= Balloon (the number of peers) Dual-CAN >= Balloon == Vanilla (Low overhead)

Related Work

• Time-to-Live (TTL) based mechanisms Caromel et al. (Parallel Computing, 2007), Mastroianni

et al (EGC, 2005). Lack of Completeness

• Encoding resource information using DHT Cheema et al. (Grid, 2005), CompuP2P (TPDS, 2006) Lack of Load balance and Parsimony

• Grids for Multi-core Desktop Condor : static partitioning to handle a multi-core

node as a set of independent entities

Conclusion and Future Work

• New decentralized Resource management for Multi-core P2P Grids Two logical nodes for static & dynamic feature Dual-CAN and Balloon Model

• Simple analytic model for multi-core simulation considering resource contention

• Evaluation via simulation Completeness (better than Multiple Peers) Load-balance (competitive with Centralized Matchmaker) Low overhead

• Future work Real experiment (co-operation with Astronomy Dept.) Resource Management for Heterogeneous Multi-processors

Decentralized Resource Management for Multi-core Desktop Grids

Jaehwan Lee, Pete Keleher, Alan Sussman

Department of Computer Science University of Maryland

Decision Functions

Target Node (Where?)

Criteria for the Best Run Node (Which?) Among the free nodes OR

Stopping Criteria (When?) Found a free node OR

Aggregated load information along the dimension d

Aggregated load information along the dimension d

Objective function to minimize

Objective function to minimize

Score function for a candidate

run node C

Score function for a candidate

run node C

Stopping Factor

Stopping Factor

Probability to stop pushing from node N

Probability to stop pushing from node N

Target Dimension

Target Dimension

decentralized resource management for multi-core desktop grids jaehwan lee, pete keleher, alan...

Documents

multicore cpu

free node cpu

quad core node cpu

gb job

multicore node contention

maxnode cpu

maxnode residuenode

multicore machine