latency hiding in dynamic partitioning and load balancing of grid computing applications

CCGrid2001, Brisbane, 5/18/01

Latency Hiding in Dynamic Latency Hiding in Dynamic Partitioning and Load Balancing of Partitioning and Load Balancing of

Grid Computing ApplicationsGrid Computing Applications

Sajal K. Das Sajal K. Das andand Daniel J. HarveyDaniel J. HarveyDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

The University of Texas at ArlingtonThe University of Texas at Arlington

E-mail: {das,harvey}@cse.uta.eduE-mail: {das,harvey}@cse.uta.edu

Rupak BiswasRupak BiswasNASA Ames Research CenterNASA Ames Research Center

E-mail: [email protected]: [email protected]


Presentation OverviewPresentation Overview

• The Information Power Grid (IPG)The Information Power Grid (IPG)• MotivationsMotivations• Load Balancing and PartitioningLoad Balancing and Partitioning• Our ContributionsOur Contributions• The new MinEX PartitionerThe new MinEX Partitioner• Experimental StudyExperimental Study• Performance ResultsPerformance Results• Conclusions and Ongoing ResearchConclusions and Ongoing Research


The Information Power Grid (IPG)The Information Power Grid (IPG)

• Harness the power of geographically separated Harness the power of geographically separated resourcesresources

• Developed by NASA and other collaborative partnersDeveloped by NASA and other collaborative partners• Utilize a distributed environment to solve large-scale Utilize a distributed environment to solve large-scale

computational problemscomputational problems• Additional relevant applications identified by I-Way Additional relevant applications identified by I-Way

experimentexperiment– Remote access to large databases with high-end Remote access to large databases with high-end

graphics facilitiesgraphics facilities

– Remote virtual reality access to instrumentsRemote virtual reality access to instruments

– Remote interactions with supercomputer simulationsRemote interactions with supercomputer simulations


MotivationsMotivations

• Develop techniques to enhance the feasibility of Develop techniques to enhance the feasibility of running applications on the IPGrunning applications on the IPG– Effective load-balancer/partitioner for a distributed Effective load-balancer/partitioner for a distributed

environmentenvironment

– Allow for latency tolerance to overcome low Allow for latency tolerance to overcome low bandwidthsbandwidths

• Predict application performance by simulationPredict application performance by simulationof IPGof IPG


Load Balancing and PartitioningLoad Balancing and Partitioning

GOALGOAL: Distribute workload evenly among processors: Distribute workload evenly among processors

• Static load balancersStatic load balancers– Balance load prior to executionBalance load prior to execution

– Examples: smart-compilers, schedulersExamples: smart-compilers, schedulers

• Dynamic load balancersDynamic load balancers– Balance as application is processedBalance as application is processed

– Examples: adaptive contracting, gradient, symmetric Examples: adaptive contracting, gradient, symmetric broadcast networksbroadcast networks

• Semi-dynamic load balancersSemi-dynamic load balancers– Temporarily stop processing to balance workloadTemporarily stop processing to balance workload

– Utilize a partitioning techniqueUtilize a partitioning technique

– Examples: MeTiS, Jostle, PLUMExamples: MeTiS, Jostle, PLUM


Our ContributionsOur Contributions

• Limitations of existing partitionersLimitations of existing partitioners– Separate partitioning and data redistribution stepsSeparate partitioning and data redistribution steps

– Lack of latency toleranceLack of latency tolerance

– Balance loads with excessive communication and data Balance loads with excessive communication and data movementmovement

• Propose a new partitioner (MinEX) for IPG Propose a new partitioner (MinEX) for IPG environmentenvironment– Minimize total runtime rather than balancing workloadMinimize total runtime rather than balancing workload

– Compensate for high latency on the IPGCompensate for high latency on the IPG

– Compare with existing methodsCompare with existing methods


The MinEX PartitionerThe MinEX Partitioner

• Diffusive algorithm with goal to minimize total runtimeDiffusive algorithm with goal to minimize total runtime• User-supplied function for latency toleranceUser-supplied function for latency tolerance• Account for data redistribution cost during partitioningAccount for data redistribution cost during partitioning• Collapse pairs of vertices incrementallyCollapse pairs of vertices incrementally• Partition the contracted graphPartition the contracted graph• Refine graph gradually to original in reverse orderRefine graph gradually to original in reverse order• Vertex reassignment considered at each refinementVertex reassignment considered at each refinement


Metrics UtilizedMetrics Utilized

• Processing Weight Processing Weight WgtWgtvv = PWgt = PWgtvv x Proc x Proccc

• Communication CostCommunication CostComm = Comm =

CWgtCWgt(v,w) (v,w) x Connect(cx Connect(cpp,c,cqq))

• Redistribution CostRedistribution CostRemap = Remap =

RWgtRWgtvv x Connect(C x Connect(Cpp,C,Cqq) if p q) if p q

• Weighted Queue LengthWeighted Queue Length

QWgt(p) = QWgt(p) =

(Wgt(Wgtvv + Comm + Remap ) + Comm + Remap )

• Heaviest load Heaviest load (MaxQWgt)(MaxQWgt)

• Lightest load Lightest load (MinQWgt)(MinQWgt)

• Average load Average load (AvgQWgt)(AvgQWgt)

• Total system loadTotal system load QWgtToT = QWgtToT = QWgt(p)QWgt(p)

• Load Imbalance FactorLoad Imbalance Factor LoadImb = LoadImb =

MaxQWgt/AvgQWgtMaxQWgt/AvgQWgt

v

p

v

p

v

p

v

p


MinVar, Gain, and ThroTTleMinVar, Gain, and ThroTTle

• Processor workload variance from MinQWgtProcessor workload variance from MinQWgt– MinVar = MinVar = pp(QWgt(p) - MinQWgt)(QWgt(p) - MinQWgt)22

– MinVar reflects the improvement in MinVar after a MinVar reflects the improvement in MinVar after a vertex reassignmentvertex reassignment

• Gain is the change(Gain is the change(QWgtToT) to total system load QWgtToT) to total system load resulting from a vertex reassignmentresulting from a vertex reassignment

• ThroTTle is a user defined parameterThroTTle is a user defined parameter– Vertex moves that improve Vertex moves that improve MinVar are allowed if MinVar are allowed if

Gain/Throttle <= Gain/Throttle <= MinVarMinVar


MinEX Data StructuresMinEX Data Structures

• Mesh: {|Mesh: {|VV|, ||, |EE|, vTot, *VMap, *VList, *EList}|, vTot, *VMap, *VList, *EList}|V||V| :: Number of active verticesNumber of active vertices

|E||E| :: Total number of edgesTotal number of edges

vTotvTot :: Total number of verticesTotal number of vertices

*VMap*VMap :: Pointer to list of active verticesPointer to list of active vertices

*VList*VList :: Pointer to complete list of verticesPointer to complete list of vertices

*EList*EList :: Pointer to list of edges Pointer to list of edges

EList entries contains EList entries contains {w,CWgt{w,CWgt(v,w)(v,w)} }

w = adjacent vertexw = adjacent vertex

CWgtCWgt(v,w)(v,w) = edge communication = edge communication

weightweight


MinEX Data StructuresMinEX Data Structures(continued)(continued)

• VList (for each vertex VList (for each vertex vv): ): {PWgt, RWgt, |{PWgt, RWgt, |ee|, *e, merge, lookup, *VMap, *heap, border}|, *e, merge, lookup, *VMap, *heap, border}

PWgtPWgt :: Computational weightComputational weight

RWgtRWgt :: Redistribution weightRedistribution weight

||ee|| :: Number of incident edgesNumber of incident edges

*e*e :: Pointer to the first edgePointer to the first edge

mergemerge :: Vertex that merged with Vertex that merged with vv (or -1) (or -1)

lookuplookup :: Active vertex containing Active vertex containing vv (or -1) (or -1)

*VMap*VMap :: Pointer to Pointer to vv’s position in VMap’s position in VMap

*heap*heap :: Pointer to heap entry for Pointer to heap entry for vv

borderborder :: Indicates if Indicates if vv is a border vertex is a border vertex


Minex Contraction PhaseMinex Contraction Phase

C2C2C2AR1

BR1

CR1

DR4

ER2

FR2

GR2

C2

C2C2 C2 C8

C8Stack VMap=

A,B,C,D,E,F,G|E|=16 |V|=7

AR1

BR1

CR1

DR4

ER2

FR2

GR2

HR3

C2

C2

C2C2 C2 C8

C8

MCC4

MF

C2

C2VMap=A,B,H,D,E,G|E|=19 |V|=67

Stack

C,F

• Form meta-verticesForm meta-verticesby collapsing edgesby collapsing edges

• Use maximalUse maximalCWgtCWgt(v,w) (v,w) / (RWgt/ (RWgtvv+RWgt+RWgtww))

ProcedureProcedure Find(v)Find(v)IfIf (merge == -1) (merge == -1) ReturnReturn v vIf If (lookup ! = -1) (lookup ! = -1) AndAnd (lookup <= vTot) (lookup <= vTot) Then Return Then Return lookup = Find(lookup)lookup = Find(lookup) Else Return Else Return lookup = Find(merge)lookup = Find(merge)


MinEX Partition PhaseMinEX Partition Phase

• Contracted graph allows efficient partitioningContracted graph allows efficient partitioning• Heap with pointers is createdHeap with pointers is created

– For each vertex, compute optimal reassignmentFor each vertex, compute optimal reassignment• MinVar, Gain, and ThroTTle criteria satisfiedMinVar, Gain, and ThroTTle criteria satisfied

• Vertices are added to the Gain min-heapVertices are added to the Gain min-heap

– The VList *heap pointer is setThe VList *heap pointer is set

• Heap is adjusted as vertices are reassignedHeap is adjusted as vertices are reassigned• Process stops when heap becomes emptyProcess stops when heap becomes empty


MinEX Refinement PhaseMinEX Refinement Phase

• Refinement proceeds in reverse order from Refinement proceeds in reverse order from contraction through popping vertex pairs off the stackcontraction through popping vertex pairs off the stack

• Reassignment of each refined vertex consideredReassignment of each refined vertex consideredand partitioning process restartedand partitioning process restarted

• Vertex lookup and merge values reset by following Vertex lookup and merge values reset by following the merge chain when edges are accessedthe merge chain when edges are accessed(if lookup > vTot)(if lookup > vTot)


Analysis of ThroTTle Values Analysis of ThroTTle Values (P=32)(P=32)

Expected MaxQWgt Expected MaxQWgt Varying ThroTTleVarying ThroTTle

Expected LoadImb Expected LoadImb Varying ThroTTleVarying ThroTTle

ThroTTle Values ThroTTle Values


Latency Tolerance ApproachLatency Tolerance Approach1.1. Send data sets to be Send data sets to be

movedmoved

2.2. Send edge dataSend edge data

3.3. Process vertices not Process vertices not waiting for edge waiting for edge communicationcommunication

4.4. Receive, unpack Receive, unpack remapped data setsremapped data sets

5.5. Receive, unpack Receive, unpack communication datacommunication data

6.6. Repeat steps 2-5 until all Repeat steps 2-5 until all vertices are processedvertices are processed

• Move data sets and edge Move data sets and edge data firstdata first

• Achieve latency tolerance Achieve latency tolerance by overlapping processing by overlapping processing with communicationwith communication

• Optimistic view: Processing Optimistic view: Processing completely hides the latencycompletely hides the latency

• Pessimistic view: No Pessimistic view: No latency hiding occurslatency hiding occurs

• Application passes to MinEX Application passes to MinEX the latency hiding functionthe latency hiding function


Experimental Study:Experimental Study:Simulation of an IPG EnvironmentSimulation of an IPG Environment• Configuration File defines clusters, processors, and Configuration File defines clusters, processors, and

interconnect slowdownsinterconnect slowdowns• Processors in a cluster are assumed homogeneousProcessors in a cluster are assumed homogeneous

• Connect(cConnect(c11, c, c22) = interconnect slowdown between) = interconnect slowdown between

clusters cclusters c1 1 and cand c22 (unity for no slowdown) (unity for no slowdown)

• If cIf c11 = c = c22, Connect(c, Connect(c11, c, c22) = intraconnect slowdown) = intraconnect slowdown

• ProcProcc c represents the processing slowdown represents the processing slowdown

(normalized to unity) within a cluster(normalized to unity) within a cluster• Configuration File mapped to processing graph by Configuration File mapped to processing graph by

MinEX so actual vertex assignments in the MinEX so actual vertex assignments in the distributed environment can be modeleddistributed environment can be modeled


Test Application:Test Application:Unstructured Adaptive MeshUnstructured Adaptive Mesh

• Time-dependent shock Time-dependent shock wave propagated thru wave propagated thru cylindrical volumecylindrical volume

• Tetrahedral mesh Tetrahedral mesh discretizationdiscretization

• Coarsen previously Coarsen previously refined elementsrefined elements

• Mesh grows from 50K to Mesh grows from 50K to 1.8M tets over nine 1.8M tets over nine adaptation levelsadaptation levels

• Workload becomes Workload becomes unbalanced as mesh is unbalanced as mesh is adaptedadapted


Characteristics Of Test ApplicationCharacteristics Of Test Application

• Mesh elements interact only with immediate Mesh elements interact only with immediate neighborsneighbors

• High communication and remapping costsHigh communication and remapping costs• Numerical solver not includedNumerical solver not included


MinEX Partitioner PerformanceMinEX Partitioner Performance

• SBN: Dynamic load-balancer based on Symmetric SBN: Dynamic load-balancer based on Symmetric Broadcast Network that was adapted for mesh Broadcast Network that was adapted for mesh applicationsapplications

• PLUM: Semi-dynamic framework for processing PLUM: Semi-dynamic framework for processing adaptive, unstructured meshesadaptive, unstructured meshes

• MinEX comparisons with SBN and PLUM:MinEX comparisons with SBN and PLUM:

P=32 Edge Cut Redistributed Elements

SBN 36.5% 19,446

PLUM 10.9% 63,270

MinEX 20.9% 30,548


Experimental ResultsExperimental Results(P=32)(P=32)

Expected runtimesExpected runtimes(no latency tolerance)(no latency tolerance)

INTERCONNECT SLOWDOWNS

Expected runtimes Expected runtimes (maximum latency tolerance)(maximum latency tolerance)

INTERCONNECT SLOWDOWNS

Runtimes in thousands of unitsRuntimes in thousands of units


Conclusions & Ongoing ResearchConclusions & Ongoing Research

• Introduced a new partitioner called MinEX and Introduced a new partitioner called MinEX and experimented in simulated IPG environmentsexperimented in simulated IPG environments

• Runtimes increase with larger slowdowns as clusters Runtimes increase with larger slowdowns as clusters are addedare added

• Additional clusters increase benefits of latency Additional clusters increase benefits of latency tolerancetolerance

• Estimated runtimes with MinEX improved by a factor Estimated runtimes with MinEX improved by a factor of five over no partitioningof five over no partitioning

• Currently applying MinEX to the N-body problem Currently applying MinEX to the N-body problem (Barnes-Hut algorithm)(Barnes-Hut algorithm)

latency hiding in dynamic partitioning and load balancing of grid computing applications

Documents