presentation overview 1. models of parallel computing the evolution of the conceptual framework...

Presentation Overview

1. 1. Models of Parallel ComputingModels of Parallel ComputingThe evolution of the conceptual framework behind parallel systems.The evolution of the conceptual framework behind parallel systems.

2.2. Grid ComputingGrid ComputingThe creation of a structure within the parallel framework to facilitate efficient use of The creation of a structure within the parallel framework to facilitate efficient use of shared resources.shared resources.

3.3. Cilk Language and SchedulerCilk Language and SchedulerA method for scheduling parallel tasks on a small low-latency network; and a A method for scheduling parallel tasks on a small low-latency network; and a programming language to provide parallel computing with time guarantees.programming language to provide parallel computing with time guarantees.

Presentation Overview

Parallel Computin

g

Parallel Computin

gGrid Computing

Scheduling Resource Discovery

Cilk

Set Matching

P2P Techniques

Routing Techniques

Models of Parallel Computing

All Models of Parallel Computing can be subdivided into these four broad categories:

• Synchronous Shared Memory Models

• Asynchronous Shared Memory Models

• Synchronous Independent Memory Models

• Asynchronous Independent Memory Models

What defines a synchronous model?

• All memory operations take exactly unit-time.

• All processors that wish to perform an operation at time t

do so simultaneously.

• Memory access conflicts are resolved using standard

concurrency techniques.

Generally speaking in a synchronous system:

Synchronous Shared Memory: PRAM

• Consists of P RAM processors each with a local register set

• Unbounded global shared memory

• Processors operate synchronously

PRAM Properties

• Processing units do not have their own memory.

• Processing units communicate only via global memory.

• Assumes synchronous memory access.

• Each processor has random access to any global memory

cell in unit-time.

Problems with PRAM

Problem 1: Assumes that the processors act synchronously without any overhead.

Problem 2: Assumes 100% processor and memory reliability.

Problem 3: Does not exploit caching or locality (all operations are “performed” in the main memory).

Problem 4: Model is unrealistic for real computers.

Asynchronous Shared Memory Models

• Most Asynchronous Shared Memory systems build on the PRAM model making it more feasible for actual implementation.

• We can easily make the PRAM model more realistic by assuming asynchronous operation, and including an explicit synchronization step after every round.

• Where a round is the smallest unit of time that allows every processor to complete computation in a given time step.

• These models can generally be implemented on MIMD architecture, and charge appropriately for the cost of synchronization.

Synchronous Independent Memory Models

• These models consist of a connected set of

processor/memory pairs

• Synchronization is assumed during computation

• The best example of a synchronous independent

memory model is Bulk-Synchronous Parallel (BSP)

Bulk-Synchronous Parallel Model (BSP):

•Processing units are processor/memory pairs

•There is a router to provide inter-processor communication

•There is a barrier synchronizer to explicitly synchronize computation

BSP Properties

• BSP is conceptually simple, and provides a nice bridge to future models of computation that do not rely on shared memory.

• BSP is intuitive from a programming standpoint

• Can use any network topology with a router.

• Inter-processor message delivery time is not guaranteed, only a lower bound can be achieved (network latency).

• Synchronous operation taken for granted in program cost.

• Synchronization time is not guaranteed.

Asynchronous Independent Memory Models

• Most asynchronous independent memory models build on the BSP framework.

• These models tend to generalize BSP while providing upper bounds on communication cost and overhead.

• We briefly summarize the LogP model.

LogP

• Provides upper bound on network latency, and thus inter-processor communication time (overhead)

• All processors are seen as equidistant (network diameter is used for analysis)

• Resolves problems with router saturation in BSP

• Solves some of BSPs practical problems.

Summary of Computing Models

• Shared memory models are conceptually ideal from a programming point of view, but difficult to implement.

• Independent memory models are more feasible, but add

complexity to synchronization.

• We will proceed to discuss Grid Computing with the general LogP model in mind.

Grid Computing

Exploring Resource Discovery Protocols

What is Grid Computing?

“A grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational resources. These resources include, but are not limited to, processors, data storage devices, software applications, and instruments such as telescopes or satellite dishes”.

[Foster, Kesselman 1998]


• Dependability: The system must provide predictable and sustained service.

• Consistency: A grid should provide uniform service despite the vast heterogeneity of connected systems.

• Pervasiveness: Services should be constantly available regardless of where you move throughout the system (or similar service should be available)

• Inexpensiveness: The distributed structure should allow for affordable use of computational power relative to income and use.


“[Grid Computing] is the synergistic use of high-performance networking, computing, and advanced software to provide access to advanced computational capabilities, regardless of the location of users and resources.”

[Foster 1998]


Goal: To access and make efficient use of remote resources.

The Power GridA Motivating Analogy

• In 1910 efficient electric power generation was possible, but every user had to have his own generator.

• Connecting many heterogeneous electric generators together in a grid provided low-cost access to standardized service.

• Similarly a computational grid could provide reliable low-cost access to computational power.

Computation today is like electricity in 1910

Why do we want a Grid?

• Solving difficult research problems

• Running large scale simulations

• Increase resource utilization

• Efficient use of scarce/distant resources

• Collaborative design and education

Major Classes of Grid Use

• Distributed Computing

• High Throughput

• On Demand

• Data Intensive

• Collaborative

Challenges of Grid Computing

• Building a Framework for communication

• Parallelizing code

• Dynamically scheduling resource use

• Providing consistent service despite heterogeneity

• Providing reliable service despite local failures

• Finding resources efficiently

Finding Resources in the Grid

1. Determine what resources we will need to solve the problem

2. Locate sufficient resources in the Grid

3. Reserve these resources

4. Execute the problem run

Given an instance (or run) of a problem we want to solve, how can we expedite the following?

Different Views of the Resource Discovery Problem

1. A Peer-to-Peer Indexing problem

2. A Routing Problem

3. A Web search/crawling problem

We can think of the Resource Discovery problem in 3 ways:

We will need to repose the Resource Discovery Problem We will need to repose the Resource Discovery Problem under each of these disciplinesunder each of these disciplines

P2P for Resource Discovery

• We will need a separate index for each resource• Several resources may be used in parallel• We want least cost fit whenever possible, but over

fit is likely acceptable• We need to have accountability for resource use,

and a way to credit users who share resources• We will want caching since users are likely to

request the same types of resources multiple times

P2P for Resource Discovery

Peer-to-Peer structure is desirable, but the

search/lookup must be modified

We will start to solve this problem employing set matching We will start to solve this problem employing set matching techniques for peer-to-peer lookups.techniques for peer-to-peer lookups.

Condor Classified Advertisements

• Condor Classified Advertisements (ClassAds) provide a mapping from attribute names to expressions

• Condor matchmaking takes two ClassAds and evaluates one w.r.t. the other

• Two ClassAds match iff each has an attribute “requirements” that evaluates as true in the context of the other ClassAd

• A ClassAd can have an attribute “rank” that gives a numerical value to the quality of a particular matching (large rank == better match).

Set Extended ClassAd Syntax

• set expressions: Place constraints on collective properties of the set (e.g. Total disk space or total processing power)

• individual expressions: Place constraints on each ClassAd in the set (e.g. Each computer must have more than 1GB of RAM)

We can extend this structure and consider a match between a single set request, and a ClassAd set:

In this context the Set Matching Algorithm will attempt to create a set of ClassAds that meets both individual and set requirements

Set Matching Algorithm

Note: The number of possible set matches is exponential in the size of ClassAds, so we will proceed with a heuristic approach.

Set Matching Algorithm: variables

ClassAdSet:Set of all ClassAd to be considered

BestSet:Closest set found so far

CandidateSet: Set consider at each iteration

LastRank: Rank of BestSet

Rank: Rank of CandidateSet

Set Matching Algorithm

While (ClassAdSet is not empty){

next={X| X=argmax(rank(Y+CandidateSet)), for all Y in ClassAdSet};ClassAdSet-=next;CandidateSet+=next;Rank=rank(CandidateSet)If (requirements(CandidateSet)==true and Rank>LastRank)

BestSet=CandidateSet;LastRank=Rank;

}

return BestSet;

Resource Discovery

• User provides mapper that maps workload for a certain application or problem to resource requirements and topology

• Resource set is compiled using MDS and a “resource monitor”

• Set-matching is applied in conjunction with the mapper to find an appropriate set of resources

We can use Set Matching for Resource Discovery:

Resource Discovery: MDS

• MDS: Monitoring and Discovery Service component of Globus™ Toolkit provides information about a server’s configuration, CPU load, etc…

• Any query tool can be used in its place• Servers can be queried periodically to

maintain central database, or as needed within P2P structure

Resource Discovery: Architecture

P2P Resource Discovery

1. Run Set-Matching locally on ClassAd NeighborSet

2. If requirements are not met forward BestSet to a neighbor

3. Repeat process without visiting a node more than once

4. Report BestSet (or CandidateSet) when TTL expires

Consider a P2P network with fixed degree topology where each node has the ClassAd for all of its neighbors

We could attempt to locate resources using the following technique:

• “Silk” is a C based runtime system for multithreaded distributed applications.

Including:

• C Language extension.

• Thread Scheduler.

What is Cilk?

• Provide a guaranteed bound on running time.

• Define a set of problems lend themselves to efficient distributed multithreading.

• Encouraging programmers to code for multithreading.

What are Cilk’s Goals?

• Multithreaded programs written in a traditional language like C/C++ typically run within an acceptable approximation of the optimal running time when used in practice.

• These same implementations often have poor worst case performance.

• Cilk guarantees performance within a constant factor of optimal, but limits itself to a subset of fully strict problems.

Motivation:

1. A fully strict computation consists of tasks that pass data only to their direct parent task. • A task is a single time unit of work.• Threads are composed of one or more tasks in order.

2. In a fully strict computation, threads can not block.• Instead, a thread spawns a special successor thread to receive return values.• Successor threads do not acquire the CPU until the return values are ready.

What is this fully strict business?

Additional Definitions:• Task – A single time unit of work, executed by exactly one processor.• Thread States – A thread can be alive (ready to execute) or stalled (waiting for data from another thread).• Activation Frame – The memory shared by tasks in a single thread, which remains allocated regardless of the state of the thread.• Activation Subtree – At any time t, the activation subtree consists of those threads that are alive.• Activation Depth – The combined size of all child activation frames with respect to a parent thread.

The Cilk Model of Multithreaded Computation

Scheduling with Work Stealing

• Work Sharing – When a task is created, the host tries to migrate it to another processor. The drawback is that threads are migrated even if overall workload is high.• Work Stealing – Under utilized processors attempt to migrate tasks from other processors.

The advantage is that under high workload, communication is minimized, because task migration only takes place when the recipient of the task has the necessary resources to service it.

Goals of Work Stealing• Keep the processors busy.• Bound runtimes.• Limit number of active threads in order to bound memory usage. • Maximize locality of related tasks (keep them on the same processor).• Minimize communication between remote tasks.

Work Stealing Definitions:• T1 = number of tasks in a computation, also the time it would take on a single processor.• TP = time used on a P processor scheduling of a computation.• T∞ = depth of the computations critical path.• S1 = activation depth of a computation on a single processor.• SP = activation depth on P processors.

Remember:• Activation Depth – The combined size of all child activation frames (allocated memory) with respect to a parent task.

Greedy Scheduling

At each step, execute anything that is ready, in any order, utilizing as many processors as you have ready tasks (i.e., tasks not waiting on a dependency).

Analysis: achieves TP <= T1 / P + T∞

In other words, it will take less than or equal to the amount of time it would take to compute each task plus the time to compute the critical path, i.e. the longest chain of dependencies.

Problem: Memory usage is unbounded.

Greedy Scheduling can duplicate memory across multiple processors.

For example, when a new task is spawned and different processors are handling the parent and the child, the parent’s address space will also be copied to the processor handling the child.

We want an algorithm that guarantees that total memory usage will be within a constant of what the computation would consume on a single processor.

Memory Usage with Greedy Scheduling

Busy-Leaves Scheduling with thread pools

A global pool is kept containing threads not bound to a processor.

All processors follow this algorithm:1. If empty, get a new process A from the pool.2. If A spawns a thread B, return A to the pool and commence

work on B.3. If A stalls, return A to the pool.4. If A dies, check if all its parent’s (B) children thread are

dead. If so, commence work on B.

This algorithm essentially guarantees that all leaves in the execution tree are busy.

Analysis of Busy-Leaves Scheduling

• TP <= T1 / P + T∞ • SP <= S1 * P.

In other words, the amount of memory allocated for the entire computation will be less than or equal to the amount of memory it would take to run on a single processor.

Problem: Competition for access to the global thread pool can slow down the overall running time.

Randomized Work-Stealing Algorithm

Randomized Work-Stealing eliminates the global shared pool, and replaces it with a stack at each processor. New tasks are put on the top of the stack, and migrated tasks are taken off the bottom of the stack.

Algorithm:

1. If empty, remove a thread from the bottom of the stack (A).

2. If A enables stalled parent B, B is placed on the stack. B may have to be found and stolen from another stack.

3. If A spawns a child C, A is put on the stack and work on C commences.

4. If A dies or stalls, check the stack for another task. If one exists, commence execution. If the stack is empty, steal the bottommost thread of a random processor

Analysis of Randomized Work Stealing (Outline)

At each step, give a dollar to every processor. The processor must put its dollar in a logical bucket. This is know as an accounting argument.

Each processor puts its dollar in:• The Work bucket if it executed a task at this step.• The Steal bucket if it initiates a steal at this step.• The Wait bucket if it waits for a queued steal request.

At the end of the computation:• There are exactly T1 dollars in Work buckets.• The expected sum of all Steal buckets is O(P T∞).• Total bytes communicated is expected O(P T∞ Smax).• The expected sum of all Wait buckets is at most the sum of the Steal buckets.

Implementation of Cilk

Cilk uses explicit continuation passing, meaning that any return values must be explicitly sent to the appropriate successor thread.

Data structures:• Closure - Holds a pointer to a function, a slot for each of its arguments and a counter indicating how many arguments are still to be supplied. A closure is ready when all its arguments are present.• Continuation - Holds a pointer to an empty closure slot. Continuations can be shared among threads, for example ?k can be passed to a spawned function to be filled in later.

Function Calls:spawn function (args) // any child threadspawn_next function (args) // successor threadssend_argument (k, value) // sends value to k

Example: Fibonacci in Cilk

thread fib (cont int k, int n){ // k is where the return value will be placed

if (n < 2) send_argument(k, n); // done; return value else {

// main work done in this sectioncont int x, y;spawn_next sum (k, ?x, ?y);spawn fib (x, n – 1);spawn fib (y, n – 2);

}}

thread sum (cont int k, int x, int y){send_argument(k, x + y);

}

Programming for Parallelism

For systems with many processors and good scheduling, performance depends on the critical path (T∞ ).

remember: Tp = T1/P + T∞

This dictates performance more and more as the number of processors increase. It is easy to see if you take the limit of Tp as P goes to infinity.

Practical Example: *Socrates

*Socrates is an chess program written in Cilk, once considered one of the strongest programs in the world. It was developed on a 32 processor cluster, but the final build was to run on a 512 processor machine.

Say, for example that there were two competing algorithms:A. T32 = 65 seconds. T1 = 2048, T∞ = 1.B. T32 = 40 seconds. T1 = 1024, T∞ = 8.

but on 512 processors, A. T512 = T1/P + T∞ = 2048/512 + 1 = 5.B. T512 = 1024/512 + 8 = 10.

Example 2: Merge-Sort

If you were to naively translate merge-sort into a parallel algorithm:

Merge-Sort(A, p, r) // Sort array A, from index p to index rif p < r // we are not done

then q <- (p + r) / 2 // split partitions in two equal piecesspawn Merge-Sort(A, p, q) // sort first partitionspawn Merge-Sort(A, q+1, r) // sort second partitionMerge(A, p, q, r) // merge the two sorted partitions

Merge-Sort Analysis

Because Merge() takes O(n) time on an array of n elements, this takes nlgn time on a single processor:T1 = O(nlgn)

For the parallel version:T∞ = T∞(n/2) + O(n)

Thus the parallel version is just O(n). This is not a great improvement. The speedup is T1/ T∞, or just lgn in this case.

Better Parallel Merge-Sort

Algorithm:• You need to merge arrays A and B (A is larger).• Take the median of A. O(1).• Partition B against the median of A. O(lgn). • Recursively merge Alow with Blow and Ahigh with Bhigh.

Analysis of Parallel Merge-Sort

Critical path: PM∞ < PM∞(3n/4) + O(lgn)

= O(lg2 n)

Because the merges are done in parallel, the worst-case combined is the worst-case for a single merge. It’s 3n/4 because in worst-case, we must merge half of A with all of B.

Example 3: Matrix Multiplication

We said that low critical path was always good, but given a limit on P, compromises can and must be made.

For example, say we have a matrix multiplication algorithm that requires 107 processors and results in a speedup from O(n3) to O(lg2 n).

If we only have 106 processors available, it would be advantages to reduce the parallelism (and thus the number of required concurrent processors), even if this results in a higher runtime, say O(n).

presentation overview 1. models of parallel computing the evolution of the conceptual framework...

Documents

synchronous memory access

memory operations

main memory

memory reliability

models of parallel computing

synchronous model

global memory cell

memory access conflicts