parallel analysis of algorithms: pram + cgm

70
Parallel Analysis of Algorithms: PRAM + CGM

Upload: rivka

Post on 23-Mar-2016

51 views

Category:

Documents


3 download

DESCRIPTION

Parallel Analysis of Algorithms: PRAM + CGM. Outline. Parallel Performance Parallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM). Parallel Analysis of Algorithms. Question?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms: PRAM + CGM

Page 2: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 2

OutlineParallel PerformanceParallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Page 3: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 3

Question?Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. Do you believe him?

Parallel Analysis of Algorithms

Page 4: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 4

Performance of a Parallel Algorithm

n : problem size (e.g.: sort n numbers)p : number of processorsT(p): parallel timeTs : sequential time (optimal sequ. alg.)s(p) = Ts / T(p) : speedup (1sp)

s

p

s(p)=p

super-linear

linear

sub-linear

Parallel Analysis of Algorithms

Page 5: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 5

Speeduplinear speedup s(p) = p optimal

super linear speedup s(p) > p : impossible

Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction.

Parallel Analysis of Algorithms

Page 6: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 6

Amdahl’s LawLet f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p].

Proof: Ts = sequ. time. The T(p) f Ts + (1-f)Ts / p.Hence

s Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p].

Parallel Analysis of Algorithms

Page 7: Parallel Analysis of Algorithms: PRAM + CGM

Amdahl’s Law

Parallel Analysis of Algorithms 7

Serial section Parallelizable sections(a) One processor

(b) Multipleprocessors

fts (1 - f)tsts

(1 - f)ts/ptp

p processors

Page 8: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 8

Amdahl’s Law

P=5

P=1

P=10

time

P=1000

Parallel Analysis of Algorithms

Page 9: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 9

s(p) 1 / [f+(1-f)/p]

f 0 : s (p) pf 1 : s(p) 1f = 0.5 : s(p) = 2 [p/(p+1)] <= 2f = 1/k : s(p) = k / [1+(k-1)/p] <= k

Amdahl’s LawParallel Analysis of Algorithms

Page 10: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 10

k

s

Parallel Analysis of Algorithms

Page 11: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 11

Scaled or Relative Speedup

Ts may be unknown (in fact, for most real experiments this is the case)

Relative speedup s’ (p) = T(1) / T(p)

s’ (p) s(p)

Parallel Analysis of Algorithms

Page 12: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 12

Efficiencye(p) = s(p) / p efficiency (0e1)

optimal linear speedup s(p) = p e(p) = 1

e’(p) = s’(p) / p Relative efficiency

Parallel Analysis of Algorithms

Page 13: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 13

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Page 14: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 14

Parallel Random Access Machine (PRAM)

Exclusive-Read (ER)Concurrent-Read (CR)

Exclusive-Write (EW)Concurrent-Write (CW)

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Shared Memory (PRAM, SMP)

Page 15: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 15

Concurrent-Write (CW) Common: All proc. must

write the same value Arbitrary: An arbitrary

value “wins” Smallest: The smallest

value “wins” Priority: The proc. with

smallest ID number “wins”

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Parallel Random Access Machine (PRAM)

Shared Memory (PRAM, SMP)

Page 16: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 16

Default: CREW (Concurrent Read Exclusive Write)

p = O(n) fine grainedmassively parallel

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Parallel Random Access Machine (PRAM)

Shared Memory (PRAM, SMP)

Page 17: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 17

Performance of a PRAM Algorithm

Optimal T = O ( Ts / p )

Efficient T = O ( logk(n) Ts / p )

NC T = O (logk(n) ) for p= polynomial (n)

Shared Memory (PRAM, SMP)

Page 18: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 18

Example: Multiply n numbersInput: a1, a2, …, an

Output: a1 * a2 * a3 * … * an

* : associative operator

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Shared Memory (PRAM, SMP)

Page 19: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 19

Algorithm 1

p = n/2

Shared Memory (PRAM, SMP)

Page 20: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 20

Analysisp = n/2 T = O( log n )

Ts = O(n), Ts / p = O(1)

algorithm is efficient & NC but not optimal

Shared Memory (PRAM, SMP)

Page 21: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 21

Algorithm 2make available only p = n / log n processorsexecute Algorithm 1 using “rescheduling”:

whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n) steps for (n / log n) processors

Shared Memory (PRAM, SMP)

Page 22: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 22

proc

Shared Memory (PRAM, SMP)

Page 23: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 23

Analysis# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n )

p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Shared Memory (PRAM, SMP)

Page 24: Parallel Analysis of Algorithms: PRAM + CGM

Problem 2: List RankingInput: A linked list represented by an array.Output: The distance of each node to the last node.

Page 25: Parallel Analysis of Algorithms: PRAM + CGM

Algorithm: Pointer Jumping

Assign proc. i to node iInitialize (all proc. i in parallel):

D(i) := 0 if P(i)=i1 otherwise

REPEAT log n TIMES (all proc. i in parallel):

D(i) := D(i) + D(P(i))P(i) := P(P(i))

Page 26: Parallel Analysis of Algorithms: PRAM + CGM
Page 27: Parallel Analysis of Algorithms: PRAM + CGM

Analysisp = nT = O( log n )

efficient & NC but not optimal

Page 28: Parallel Analysis of Algorithms: PRAM + CGM

Problem 3: Partial SumsInput: a1, a2, …, an

Output: a1

a1 + a2

a1 + a2 + a3

... a1 + a2 + a3 + … + an

Page 29: Parallel Analysis of Algorithms: PRAM + CGM

Parallel RecursionCompute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

The result is: (a1+a2) (a1+a2+a3+a4) (a1+a2+a3+a4+a5+a6 ) ... (a1+a2... an-

3+an-2) (a1+a2+an-1+an)Compute each gap by multiplying its predecessor by a single number

Page 30: Parallel Analysis of Algorithms: PRAM + CGM

Analysisp = nT (n) = T(n/2) + O(1)T(1) = O(1) T(n) = O(log n)

efficient and NC but not optimal

Page 31: Parallel Analysis of Algorithms: PRAM + CGM

Improving through rescheduling

set p = n / log nsimulate previous algorithm

Page 32: Parallel Analysis of Algorithms: PRAM + CGM

proc

Page 33: Parallel Analysis of Algorithms: PRAM + CGM

Analysis

# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n

)p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Page 34: Parallel Analysis of Algorithms: PRAM + CGM

Problem 4: SortingInput: a1, a2, …, an

Output: a1, a2, …, an permuted into sorted order

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

12

j

n-1n

Page 35: Parallel Analysis of Algorithms: PRAM + CGM

Unimodal sequence:9 10 13 17 21 19 16 15

Bitonic sequence: cyclic shift of a unimodal sequence

16 15 9 10 13 17 21 19

Bitonic Sorting (Batcher)

Page 36: Parallel Analysis of Algorithms: PRAM + CGM

Properties of bitonic sequences

X = x1 x2... xn xn+1 xn+2 ... x2n bitonicL(X) = y1 ... yn yi = min {xi, xn+i}U(X) = z1 ... zn zi = max {xi, xn+i}

(1) L(X) and U(X) are bitonic(2) every element of L(X) is smaller

than every element of U(X).

Page 37: Parallel Analysis of Algorithms: PRAM + CGM

Bitonic Merge: sorting a bitonic sequence

a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

Page 38: Parallel Analysis of Algorithms: PRAM + CGM

sorting an arbitrary sequence a1, a2, …, an

split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an

recursively, in parallel, sort each sub-sequence using p/2 processorsmerge the two sorted sub-sequences into one sorted sequence using bitonic merge

Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

Page 39: Parallel Analysis of Algorithms: PRAM + CGM

Analysisp = nT (n) = T(n/2) + O(log n)T(1) = O(1) T(n) = O(log2 n)

efficient and NC but not optimal

Page 40: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 40

So what about a SMP machine?

PRAM? EREW? CREW? CRCW?How does OpenMP play into this?

Page 41: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 41

OpenMP/SMP= CREW PRAM but coarse grainedT(p) f Ts + (1-f)Ts / p, for f = sequential fractionT(n,p) = f Ts + sum over all parallel regions of max time fork

Parallel Regions

Master Thread

Page 42: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 42

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Page 43: Parallel Analysis of Algorithms: PRAM + CGM

Distributed Memory Models

Page 44: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Computingp: # processorsn: problem sizeTs(n): sequential timeT(p,n): parallel time

speedup: S(p,n) = Ts(n) / T(p,n)

Goal: obtain linear speedup S(p,n)=p

Page 45: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Computers

BeowulfCluster

...

Blue Gene/Q

Cray XK7Custom MPP (Tianhe-2)

Page 46: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Machine ModelsHow to abstract the machine into a

simplified model such that algorithm/application design is not

hampered by too many details calculated time complexity

predictions match the actually observed running times (with sufficient accuracy)

Page 47: Parallel Analysis of Algorithms: PRAM + CGM

Parallel Machine ModelsPRAMFine grained networks (array, ring, mesh, hypercube)Bulk Synchronous Parallelism (BSP), Valiant, 1990Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993Multithread (CILK), Leiserson, 1995

many more...

Page 48: Parallel Analysis of Algorithms: PRAM + CGM

PRAMp=O(n) processors

massively parallel...PPPPPP

shared memory

12

...

n-1n

Page 49: Parallel Analysis of Algorithms: PRAM + CGM

Example: PRAM Sort

list merge…

Bitonic Sort: O(log n) per merge => O(log2 n)Cole: O(1) per merge => O(log n)

PPPPPP

shared memory

12

...

n-1n

Page 50: Parallel Analysis of Algorithms: PRAM + CGM

Fine Grained Networks

p=O(n) processors

massively parallel ...

P PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP P

Page 51: Parallel Analysis of Algorithms: PRAM + CGM

Example: Mesh Sort

O(n1/2) time

sub mesh merge

Page 52: Parallel Analysis of Algorithms: PRAM + CGM

Back to reality...Would anyone use a parallel machine with n processors in order to sort n items ?

Of course NOT…

Typical parallel machines have large ratios n/p (e.g. n/p = 16M)

Page 53: Parallel Analysis of Algorithms: PRAM + CGM

Brent's TheoremMapping: Fine grained => Coarse grained.Via virtual processorsIf we simulate n virtual processors on p real processors then S(p) = S(n) * p/n

S(n)=O(n) "optimal" => S(p)=O(p) "optimal"

Page 54: Parallel Analysis of Algorithms: PRAM + CGM

The Problem!• Fine Grained PRAM and Fixed

Network algorithms are VERY slow when implemented on commercial parallel machines.

Page 55: Parallel Analysis of Algorithms: PRAM + CGM

Why ?

Page 56: Parallel Analysis of Algorithms: PRAM + CGM

Why ?

p=n

S(p)S(p)

P=n

Page 57: Parallel Analysis of Algorithms: PRAM + CGM

Why ?

The assumption is not true: in most cases, S(n) is NOT optimal

p=n

S(p)n1/2

n1/2

S(n) = n log n / n1/2

Page 58: Parallel Analysis of Algorithms: PRAM + CGM

CGM

p=n p

S(p)

Coarse Grained MulticomputerDehne, Rau-Chaplin, 1993

S(p)

pp=n

Page 59: Parallel Analysis of Algorithms: PRAM + CGM

CGMCoarse grained memoryCoarse grained computation Coarse grained communication

Coarse Grained Multicomputer

Page 60: Parallel Analysis of Algorithms: PRAM + CGM

Coarse Grained Memory

Ignore small n/p

e.g. assume n/p > p

network orshared memory

proc mem

comm proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

n/p

n/p

n/p

n/p

n/p

n/p

n/p

proc mem

comm

Page 61: Parallel Analysis of Algorithms: PRAM + CGM

Coarse Grained Comp.

PPPP

time

round 1 round 3round 2

Compute in supersteps with barrier synchronization (as in BSP)

Page 62: Parallel Analysis of Algorithms: PRAM + CGM

Coarse Grained Comm.

PPPP

time

round 1 round 3round 2

• All communication steps are h-relations, h=O(n/p)• No individual messages

h-relat ion

h-relat ion

h-relat ion

Page 63: Parallel Analysis of Algorithms: PRAM + CGM

h-Relationproc

mem

comm

procmem

comm

procmem

comm

procmem

comm

h=O(n/p)

Page 64: Parallel Analysis of Algorithms: PRAM + CGM

CGM

• Complexity measure:

– number of rounds (e.g. O(1), O(log p), …)– scalability (e.g. n/p > p)– local computation

– communication volume

Coarse Grained Multicomputer

Page 65: Parallel Analysis of Algorithms: PRAM + CGM

CGM

• Coarse grained memory• Coarse grained computation • Coarse grained communication

=> - practical parallel algorithms- efficient and portable

Coarse Grained Multicomputer

Page 66: Parallel Analysis of Algorithms: PRAM + CGM

Det. Sample SortCGM Algorithm:

sort locally and create p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Page 67: Parallel Analysis of Algorithms: PRAM + CGM

Det. Sample SortCGM Algorithm:

• send all p-samples to processor 1

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Page 68: Parallel Analysis of Algorithms: PRAM + CGM

Det. Sample SortCGM Algorithm:

• proc.1: sort all received samples and compute global p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Page 69: Parallel Analysis of Algorithms: PRAM + CGM

Det. Sample SortCGM Algorithm:

• broadcast global p-sample

• bucket locally according to global p-sample

• send bucket i to proc.i

• resort locallyn/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Page 70: Parallel Analysis of Algorithms: PRAM + CGM

Det. Sample SortCGM Algorithm:

• O(1) roundsfor n/p > p2

• O(n/p log n) local comp.

• Goodrich (FOCS'98): O(1) roundsfor n/p > pe

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample