parallel and distributed algorithms

Parallel and Distributed Algorithms

Overview

Parallel Algorithm vs Distributed Algorithm

PRAM Maximal Independent Set Sorting using PRAMChoice coordination problemReal world applications

INTRODUCTION

Need of distributed processingA massively parallel processing machineCPUs with 1000 processorsMoore’s law coming to an end

Parallel AlgorithmA parallel algorithm is an algorithm which

can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.*

* Blelloch, Guy E.; Maggs, Bruce M. Parallel Algorithms. USA: School of Computer Science, Carnegie Mellon University.

Distributed AlgorithmA distributed algorithm is an algorithm

designed to run on computer hardware constructed from interconnected processors.*

*Lynch, Nancy (1996). Distributed Algorithms. San Francisco, CA: Morgan Kaufmann Publishers. ISBN 978-1-55860-348-6.

Random Access Machine

An abstract machine with unbounded number of local memory cells and with simple set of instruction sets

Time complexity: number of instructions executed

Space complexity: number of memory cells used

All operations take Unit time

PRAM (Parallel Random Access Machine)PRAM is a parallel version of RAM for designing the

algorithms applicable to parallel computers

Why PRAM ? The number of processor execute per one cycle on P processors is

at most P Any processor can read/write any shared memory cell in unit time It abstracts from overhead which makes the complexity of PRAM

algorithm easier It is a benchmark

Read A[i-1], Computation A[i]=A[i-1]+1, Write A[i]

Shared Memory (A)

A[0]A[1]=A[0]+1

A[2]=A[1]+1

A[n]=A[n-1]+1

A[n-1]

Share Memory Access ConflictsExclusive Read(ER) : all processors can

simultaneously read from distinct memory locations

Exclusive Write(EW) : all processors can simultaneously write to distinct memory locations

Concurrent Read(CR) : all processors can simultaneously read from any memory location

Concurrent Write(CW) : all processors can write to any memory location

EREW, CREW, CRCW

Complexity

Parallel time complexity: The number of synchronous steps in the algorithm

Space complexity: The number of share memory cell

Parallelism: The number of processors used

MAXIMAL INDEPENDENT SET

Lahiru SamarakoonSumanaruban Rajadurai

Independent Set (IS):Any set of nodes that are not adjacent

Maximal Independent Set (MIS):An independent set that is nosubset of any other independent set

Maximal vs. Maximum IS

a maximum independent seta maximal independent set

A Sequential Greedy algorithm

Suppose that will hold the final MISIInitially I

Pick a node and add it to I1v

Phase 1:

Remove and neighbors )( 1vN1v

Pick a node and add it to I2v

Phase 2:

Remove and neighbors )( 2vN2v

Repeat until all nodes are removedPhases 3,4,5,…:

Repeat until all nodes are removed

No remaining nodes

Phases 3,4,5,…,x:

At the end, set will be an MIS of I G

Running time of algorithm: )(nO

Worst case graph:

n nodes

Intuition for parallelization

At each phase we may select any independent set (instead of a single node), remove S and neighbors of S from the graph.

Suppose that will hold the final MISIInitially I

Example:

Find any independent set S

Phase 1:

And insert to :I SII

remove and neighbors )(SN

Phase 2:

Find any independent setAnd insert to :I SII

On new graph

remove and neighbors )(SNS

remove and neighbors )(SN

Phase 3:

Find any independent setAnd insert to :I SII

On new graph

No nodes are left

Final MIS I

The number of phases depends on the choice of independent set in each phase

The larger the independent set at eachphase the faster the algorithm

Observation:

1 2 )(vd

Let be the degree of node)(vd v

Randomized Maximal Independent Set ( MIS )

Each node elects itselfwith probability

At each phase :k

)(21)(vd

1 2 )(vd

degree ofin

Elected nodes are candidates for theindependent set

If two neighbors are elected simultaneously,then the higher degree node wins

Example:

2)()( vdzd

If both have the same degree, ties are broken arbitrarily

Example:

2)()( vdzd

Problematic nodeskG

Using previous rules, problematic nodesare removed

The remaining elected nodes formindependent set S

mark lower-degree vertices with higher probability

Luby’s algorithm

Problematic nodeskG

Using previous rules, problematic nodesare removed

if both end-points of an edge is marked, unmark the one with the lower degree

Luby’s algorithm

The remaining elected nodes formindependent set S

remove marked vertices with their neighbors and corresponding edges

add all marked vertices to MIS

Luby’s algorithm

ANALYSIS

Goodness property

• A vertex v is good at least ⅓ of its neighbors have lesser degree than it. bad otherwise.

• An edge is bad if its both endpoints are bad. good otherwise.

Lemma 1Let v Є V be a good vertex with degree d(v) >

0.Then, the probability that some vertex w in N( v) gets marked is at least 1 - exp( -1 / 6).

Define L(v) is set of neighbors of v whose degree is lesser than v’s degree.

By definition, |L(v)|≥d(v)/3 if v is a GOOD vertex.

Lemma 2 During any iteration, if a vertex w is marked then it

is selected to be in S with probability at least 1/2.

• From lemma1 and 2 => The probability that a good vertex belongs to S U N(S) is at least (1 - exp(-1/6))/2.

• Good vertices get eliminated with a constant probability.

• It follows that the expected number of edges eliminated during an iteration is a constant fraction of the current set of edges.

• This implies that the expected number of iterations of the Parallel MIS algorithm is O(log n).

Lemma 3In a graph G(V,E), the number of good edges is at

least |E|/2.ProofDirect the edges in E from the lower degree

endpoint to the higher degree end-point, breaking ties arbitrarily.

for each bad vertex v

For all S, T С V, define the subset of the (oriented) edges E(S, T) as those edges that are directed from vertices in S to vertices in T

Let VG and VB be the set of good and bad vertices

SORTING ON PRAM

Jessica Makucka Puneet Dewan

SortingCurrent problem: sort n numbers

Best average case for sorting is O(nlog n)

Can we do better with more processors?

Notes about QuicksortSort n numbers on a PRAM with n

processorsAssume all numbers are distinctCREW PRAM for this caseEach of the n processors contains an

input element

Notation:Let Pi denote ith processor

Quicksort Algorithm0. If n=1 stop1. Pick a splitter at random from n elements2. Each processor determines whether its element is

bigger or smaller than the splitter3. Let j denote splitters rank:

• If j [n/4, 3n/4] means failure, go back to (1)• If j [n/4, 3n/4] means success and move splitter to Pj Every

element smaller than j is moved to distinct processor Pi for i < j and the larger elements are moved to distinct processor Pk for k > j

4. Sort elements recursively in processors P1 through Pj-

1, and the elements in processors Pj+1 through Pn

Quicksort Time AnalysisAlgorithm

1. Pick a successful splitter at random from n elements (assumption)

2. Each processor determines whether its element is bigger or smaller than the splitter

Time Analysis of each stage

1. O(logn) stages for every sequence or recursive split

2. Trivial – can be implemented in single CREW PRAM step

Quicksort Time Analysis3. Let j denote splitters

rank:• If j [n/4, 3n/4] go back

to 1.• If j [n/4, 3n/4] move

splitter to Pj Every element smaller than j is moved to distinct processor Pi for i < j and the larger elements are moved to distinct processor Pk for k > j

O(log n) PRAM steps needed for the single splitting stage

Comparison Splitting Stage (3)

P1 P2 P3 P4 P5 P6 P7 P8

12 3 7 5 11 2 1 14splitter

0 1 1 1 11 0

Assign bit depending on if Pi’s element is smaller or bigger than the splitter - 0 if element is bigger - 1 otherwise

Comparison Splitting Stage (3)

P1 P2 P3 P4 P5

12 3 7 5 11splitter

1 1 10

Step 1:

Step 2:

Overall Time AnalysisThis algorithm would terminate in O(log2

n) stepsEach step is O(log n) for splitting stageO(log n) steps

Derived from this solved equation:

In this algorithm, There is an assumption that split will always be successful and it will break the problem from N to a constant fraction of N.No Suitable method for successful

split.

Improvement

IdeaReduce the problem into size of n1-e where e<1 while keeping the time to split the same.

Benefitsif e=1/2

The total time for the entire problem size will be: log n + log n1/2+log n1/4+… resulting in O(log n)

Then we could hope for an overall running time of O(log n).

Long Story

Suppose that we have n processors and n elements.Suppose that processors P1 through Pr, contain r of the elements in sorted order, and that processors Pr+1 through Pn contain the remaining n - r elements.

1.Choose Random Splitters and sort them.Let the sorted elements in the first r processors the splitters. For 1 < =j <= r, let sj denote the jth largest splitter.

2. Insert Insert the n - r unsorted elements among the splitters.

3.Sort remaining elements among splitters a. Each processor should end up with a distinct input element. b. Let i(sj) denote the index of the processor containing sj following the insertion operation. Then, for all k < i(sj), processor Pk contains an element that is smaller than sj similarly, for all k > i(sj), processor Pk contains an element that is larger than sj.

Example

Choose Random Splitter5 9 8 10 7 6 12 11

Example (Contd.)Sort the random splitters.

Sorted List Unsorted List

6 11 5 9 8 7 10 12

Example(Contd.)Insert the unsorted elements among the

splitters

5 6 7 9 8 10 11 12

Example(Contd.)

Check the number of elements between the splitters has size less than or equal to (Log n ) or not.

Suppose S represents size

S=4 (exceeds log n i.e 3) S=1 S=1

56798101112

5 6 7 9 8 10 11 12

Example Contd.Recur on the sub problem whose size exceeds log n. Again choose random splitters and follow the same process

Random Splitters

5 6 7 9 8 10 11 12

Partitioning as tree

Tree formed from first partition.Now the size on the right exceeds log n, so we again split by

choosing random partitions. E.g 9,8

Size on right

exceeds log n

Contd.

Sorted because of partition

Lemma’s to be Used

1. A CREW PRAM having (n2) processors. Suppose that each of the processors P1 through Pn has an input element to be sorted.

Then the PRAM can sort these n elements in O(log n).2. For n processors, and n elements of which n1/2 are splitters , then the insertion process can be completed in O(log n) steps.

Box Sort

Algorithm :Input: A set of numbers S .Output: The elements of S sorted in increasing order. 1 . Select n1/2 (e is 1/2)elements at random from the n input elements. Using all n processors, sort them in O(log n) steps.(Fact 1) 2. Using the sorted elements from Stage 1 as splitters, insert the remaining elements among them in O(log n) steps(Fact 2) 3. Treating the elements that are inserted between adjacent splitters as subproblems , recur on each sub-problem whose size exceeds log n. For subproblems of size log n or less, invoke LogSort.

Sort Fact

A CREW PRAM with m processors can sort m elements in O(m) steps.

Example

Each Processor is assigned an element and compares its element with remaining elements simultaneously in O(m) steps.

Rank assigned implies elements are sorted.

4 7 6 5 8 2 3 15 9 8 7 10 3 4 2

P1 P2 P3 P4 P5 P6 P7 P8

Ranks Assigned

Things to remember Last statement of Box Sort algorithm. Idea on the previous slide.

Log SortWe will be having log n processors with

log n elements then we can sort in O(log n).

AnalysisConsider each node of the tree as a box.Choosing random splitters and sort them take time of

O(log n).Insert the unsorted elements among the splitters takes

O(log n).With high probability (assumption) the sub problems

resulting from splitting operation are very small(i.e the unsorted elements among the splitters).

So each leaf is a box of size at most log n.For calculating the time spent, we can use the Log Sort

which sorts the elements in O(log n)Total time is O(log n)

DISTRIBUTED RANDOMIZED ALGORITHM

Yogesh S RawatR. Ramanathan

CHOICE COORDINATION PROBLEM (CCP)

Biological Inspiration

mite (genus Myrmoyssus)

reside as parasites on the ear membrane

of the moths of family Phaenidae

reside as parasites on the ear membrane

of the moths of family Phaenidae

Moths are prey to bats and the only defense they have is that they can hear the sonar used by an approaching bat

if both ears of the moth are infected by the mites, then their ability to detect the sonar is considerably diminished, thereby

severely decreasing the survival chances of both the moth and its colony of mites.

The mites are therefore faced with a "choice coordination problem"

How does any collection of mites infecting a particular ear ensure that every other mite chooses the same ear?

Problem Specification

Set of N processors

M options to choose from

Set of N processors

processors have to reach a consensus on unique choice

M options to choose from

Model for Communication

•Collection of M read-write registers accessible to all the processors

– Locking mechanism for conflicts•Each processor follow a protocol for making a

choice– A special symbol (√) is used to mark the choice

•At the end only one register contains the special symbol

Deterministic Solution

•Complexity is measured in terms of number of read and write operations

•For a deterministic solution– Complexity in terms of operations : Ω(n1/3)

• n - Number of processors

For more details - M. O. Rabin, “The choice coordination problem,” Acta Informatica, vol. 17, no. 2, pp. 121–134, Jun. 1982.

Randomized Solution

for any c > 0It will solve the problem using c operations

with a probability of success atleast 1-2-Ω(c)

For simplicity we will consider only the case where

n = m = 2

although the protocol can be easily generalized

Analogy from Real Life

Random Action - Give way or Move ahead

Give way Give wayMove Ahead Move AheadMove Ahead Give way

Give way Move Ahead

Person 1 Person 2

Give way Give wayMove Ahead Move AheadMove Ahead Give way

Give way Move Ahead

Breaking Symmetry

Person 1 Person 2

Synchronous CCP The two processors are synchronous Operate in lock-step according to some global clock

Used terminology Pi – processor i, where i ϵ {0,1} Ci – shared register for choices, where i ϵ {0,1} Bi – local variable for each processor, where i ϵ {0,1}

Synchronous CCP

B0B1C0

The processor Pi initially scans the register Ci Thereafter, the processors exchange registers after every

iteration At no time will the two processors scan the same register.

Synchronous CCP

B0B1C0

The processor Pi initially scans the register Ci Thereafter, the processors exchange registers after every

iteration At no time will the two processors scan the same register.

Synchronous CCP

B0B1C1

The processor Pi initially scans the register Ci

Thereafter, the processors exchange registers after every iteration

At no time will the two processors scan the same register.

AlgorithmInput: Registers C0 and C1initialized to 0Output: Exactly one of the two registers has the value √

Step 0 - Pi is initially scanning the register Ci

Step 1 - Read the current register and obtain a bit Ri

Step 2 - Select one of three cases

case: 2.1 [Ri = √]» halt

case: 2.2 [Ri = 0, Bi = 1 )» Write √ into the current register and halt

case: 2.3 [otherwise]» Assign an unbiased random bit to Bi

» write Bi into the current register

Step 3 - Pi exchanges its current register with P1 - i and returns to Step 1 .

case: 2.2 [Ri = 0, Bi = 1 ]» Write √ into the current register and halt

Read Operation

Choice has already been made by the other processor

Input: Registers C0 and C1initialized to 0Output: Exactly one of the two registers has the value √

Algorithm

Only condition for making a choice

Algorithm

Generate a random valueWrite operation

Algorithm

Exchange Registers

Correctness of Algorithm

We need to prove only one of the shared register has √ marked in it

Suppose that both are marked with √–This must have had in same iteration–Otherwise step 2.1 will halt the algorithm

Correctness of AlgorithmLet us assume that the error takes place during the tth

iterationAfter step 1 values for processor Pi

Bi(t) and Ri(t)By case 2.3

R0(t) = B1(t) R1(t) = B0(t)

Suppose Pi writes √ in the tth iteration, then Ri = 0 and Bi = 1 and R1-i = 1 and B1-i = 0 P1-i cannot write √ in ith iteration Breaking Symmetry

R0 B00 0

Read Operation

R1 B10 0

C0 C10 0

Processor 0 Shared Registers Processor 1

R0 B00 0

Write Operation

R1 B10 0

C0 C10 0

Random Random

R0 B00 0

Read Operation

R1 B10 0

C0 C10 0

R0 B00 0

Write Operation

R1 B10 0

C0 C10 0

Random Random

R0 B00 0

Read Operation

R1 B10 0

C0 C10 0

R0 B00 0

Write Operation

R1 B10 0

C0 C10 0

Random Random

R0 B00 0

Read Operation

R1 B10 0

C0 C10 0

R0 B00 0

Write Operation

R1 B10 0

C0 C10 0

√ 0/1

Random

R0 B00 0

√ 0/1

Read Operation

R1 B10 0

C0 C10 0

√ 0/1

Complexity•Probability that both the random bits B0 and B1

are the same is 1/2•Therefore probability that number of steps

exceeds t is 1/2t.The algorithm will terminate in next two steps

as soon as B0 and B1are different.•Computation cost of each iteration is bounded

–Therefore, the protocol does O(t) work with probability 1-1/2t

The Problem

The processors are not synchronized

What can we do?

Idea: Timestamp

Read <timestamp,value>

T1 T2t2

Timestamp of Processor: Ti

Timestamp of Register : ti

Input: Registers C1 and C2 initialized to <0,0>

Output: Exactly one of the two registers has Ö

Algorithm

0) Pi initially scans a randomly chosen register. <Ti, Bi> are initialized to <0,0>

1) Pi gets a lock on its current register and reads <ti, Ri>

2) Pi executes one of these cases:

2.1) If Ri = Ö : HALT

2.2) If Ti < ti : Ti ti and Bi Ri2.3) If Ti > ti : Write Ö into the current register and HALT

2.4) If Ti = ti, Ri = 0, Bi = 1 : Write Ö into the current register and HALT

2.5) Otherwise: Ti Ti + 1 and ti ti + 1 Bi Random (unbiased) bit

Write <ti, Bi> into the current register

3) Pi releases the lock on its current register, moves to the other register and returns to step 1.

Algorithm for a process Pi

Initial state

B2 T2B1 T1 C1 t1

Processor P1 Register R2Register R1 Processor P2

1) P1 chooses C1 and reads <0,0>

B2 T2B1 T1

History : P1==C1

B2 T2B1 T1

[None of the cases from 2.1 to 2.4 are met.Case 2.5 is satisfied]

History : P1==C1

2.5) T1 T1 + 1 and t1 t1 + 1

B2 T2C2 t2

History : P1==C1

2.5) P1 writes <t1, B1> into C1

B2 T2C2 t2

History : P1==C1

3) P1 releases the lock on C1

B2 T2C2 t2

[P1 moves to C2 and returns to step 1]

History : P1==C1

History : P1==C1 P2==C2

2.5) T2 T2 + 1 and t2 t2 + 1

1) P2 locks C1 and reads <1,1>

History : P1==C1 P2==C2 P2==C1

2.5) T2 T2 + 1 and t1 t1 + 1

History : P1==C1 P2==C2 P2==C1 P2==C2

[Case 2.3: T2 > t2 is satisfied]

2.3) P2 writes Ö into C2

[P2 HALTS]

We’ll show another case of the algorithm

Let’s go back 1 iteration

2.5) T1 T1 + 1 and t2 t2 + 1

2.5) B1 Random (unbiased) bit

History : P1==C1 P2==C2 P2==C1 P1==C2 P2==C2

2.5) T2 T2 + 1 and t2 t2 + 1

History : P1==C1 P2==C2 P2==C1 P1==C2 P2==C2 P1==C1

[Case 2.4: T1 = t1, R1 = 0, B1 = 1 is satisfied]

2.4) P1 writes Ö into C1

2.4) P1 HALTS

1) P2 locks C1 and reads Ö

History : P1==C1 P2==C2 P2==C1 P1==C2 P2==C2 P1==C1 P2==C1

1) P2 locks C1 and reads Ö

[Case 2.1: R1 = Ö is satisfied]

2.1) P2 HALTS

Correctness

When a processor writes Ö on a register, the other processor should NOT write Ö on the other register

Correctness

Case 2.3) Ti > ti: Write Ö into the current register and halt.

Case 2.4) Ti = ti, Ri = 0, Bi = 1: Write Ö into the current register and halt.

Ti * : Current timestamp of processor Pi

ti* : Current timestamp of register Ci

Whenever Pi finishes an iteration in Ci,

Ti = ti

Correctness

Ti * : Current timestamp of processor Pi

ti* : Current timestamp of register Ci

When a processor enters a register, it would have just left the other register

Correctness

2.3) Ti > ti: Write Ö into the current register and HALT

Consider P1 has just entered C1 with t1* < T1*

0 2 (T1*)

1 1 (t1*)

0 2 (T1*)

1 1 (t1*)

In prev iter, P1 must have left C2 with same T1*

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

0 2 (T1*)

1 1 (t1*)

P2 must go to C2 only after C1

T1* ≤ t2

0 2 (T1*)

1 1 (t1*)

T2* ≤ t1

T1* ≤ t2

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

* T2* ≤ t1

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

* T2* ≤ t1

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

* T2* ≤ t1

Summing up : T2* ≤ t1

*< T1* ≤ t2

0 2 (T1*)

1 1 (t1*)

T1* ≤ t2

* T2* ≤ t1

T2* < t2

* : T2 cannot write Ö

2.4) Ti = ti, Ri = 0, Bi = 1: Write Ö into register and HALT

Similarly consider P1 has entered C1 with t1* = T1*

1 1(T1*)

0 1(t1*)

T1* ≤ t2

T2* ≤ t1

Summing up : T2* ≤ t1

*= T1* ≤ t2

1 1(T1*)

0 1(t1*)

T1* ≤ t2

T2* ≤ t1

T2*≤ t2

*, R2 = 1, B2 = 0 : T2 cannot write Ö

• Cost is proportional to the largest timestamp

• Timestamp can go up only in case 2.5

• Processor’s current Bi value is set during a visit to the other register

• So, synchronous case complexity applies

Complexity

REAL WORLD APPLICATIONS

Pham Nam Khanh

Applications of parallel sorting• Sorting is fundamental algorithm in data processing:

» Parallel Database operations: Rank, Join, etc.» Search (rapid index/lookup after sort)

• Best record in sorting: 102.5 TB in 4,328 seconds using2100 nodes from Yahoo.

Applications of MISWireless and communicationScheduling problemPerfect matching => assignment problemFinance

Applications of Maximal independent set

Market graph

EAFE EM

Low latencyrequirement

Parallel MIS

Market graph

StocksCommoditiesBonds

Market graph

ÞMIS form completely diversified portfolio, where all instruments are negatively correlated with each other => lower the risk

Applications of Choice coordination algorithm• Given n processes, each one can

choose between m options. They need to agree on unique choice. => belongs to class of distributed consensus algorithms.

• HW and SW task involving concurrency• Clock sync in wireless sensor networks• Multivehicle cooperative control

Multivehicle cooperative control

• Coordinate the movement of multiple vehicles in a certain way to accomplish an objective.

• Task Assignment, cooperative transport, cooperative role assignment, air traffic control, cooperative timing.

CONCLUSION

Conclusion

• PRAM model: CREW• Parallel algorithm Maximal Independent

Set with O(log n) and applications• Parallel sorting algorithm: QuickSort with

O(log2 n) BoxSort with O(log n) • Choice Coordination Problem: distributed

algorithms for synchronous and asynchronous system + applications

parallel and distributed algorithms

number of processors

time of algorithm

set of nodes

memory locationerew

parallel algorithms

p processors

interconnected processors

shared memory cell

Documents

parallel and distributed sparse optimization algorithms...

boosting algorithms for parallel and distributed...

distributed algorithms for constructing approximate minimum...

parallel genetic algorithms with distributed-environment ...

parallel and distributed algorithms eric vidal reference: r....

parallel computers 1 parallel and distributed computing...

parallel and distributed algorithms spring 2010 johnnie w....

a load-balanced parallel and distributed sorting algorithm...

parallel and distributed algorithms spring 2010

distributed frameworks and parallel algorithms for...

parallel and distributed algorithms - schloss dagstuhl :...

parallel algorithms & distributed computing

crystal in parallel: replicated and distributed (mpp) … in...

nimble algorithms for cloud computing · cloud computing...

parallel and distributed algorithms spring 2007 johnnie w....

distributed and parallel algorithms and systems for...

parallel and distributed algorithms. overview parallel...

on retries in parallel distributed load balancing...

distributed-memory parallel algorithms for distance-2...

ucsb computer science · computer science graduate student...