winter 2014parallel processing, fundamental conceptsslide 1 2 a taste of parallel algorithms learn...

35
Winter 2014 Parallel Processing, Fundamental Concepts Slide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block parallel computations On 4 simple parallel architectures (20 combinations) Topics in This Chapter 2.1 Some Simple Computations 2.2 Some Simple Architectures 2.3 Algorithms for a Linear Array 2.4 Algorithms for a Binary Tree 2.5 Algorithms for a 2D Mesh 2.6 Algorithms with Shared Variables

Upload: lawrence-timothy-melton

Post on 19-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 1

2 A Taste of Parallel Algorithms

Learn about the nature of parallel algorithms and complexity:• By implementing 5 building-block parallel computations• On 4 simple parallel architectures (20 combinations)

Topics in This Chapter

2.1 Some Simple Computations

2.2 Some Simple Architectures

2.3 Algorithms for a Linear Array

2.4 Algorithms for a Binary Tree

2.5 Algorithms for a 2D Mesh

2.6 Algorithms with Shared Variables

Page 2: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 2 

2.1.SOME SIMPLE COMPUTATIONS

Five fundamental building-block computations are:

1. Semigroup (reduction, fan-in) computation

2. Parallel prefix computation

3. Packet routing

4. Broadcasting, and its more general version, multicasting

5. Sorting records in ascending/descending order of their keys

Page 3: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 3 

2.1.SOME SIMPLE COMPUTATIONS

Semigroup Computation:

Let be an associative binary operator; For example: ⊗

(x ⊗ y ) ⊗ z = x (⊗ y ⊗ z ) for all x, y, z ∈ S. A semigroup is simply a pair (S, ), where ⊗ S is a set of elements on which is ⊗defined.

Semigroup (also known as reduction or fan-in ) computation is defined as: Given a list of n values x0, x1, . . . , xn–1, compute x0 ⊗x1 . . . ⊗ ⊗ xn–1 .

Page 4: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 4 

2.1.SOME SIMPLE COMPUTATIONS

Semigroup Computation:

Common examples for the operator include +, ×, , , , ∩, ⊗ ∧ ∨ ⊕, max, min.∪

The operator may or may not be commutative, i.e., it may or ⊗may not satisfy x ⊗ y = y ⊗ x (all of the above examples are, but the carry computation, e.g., is not).

while the parallel algorithm can compute chunks of the expression using any partitioning scheme, the chunks must eventually be combined in left-to-right order.

Page 5: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 5 

2.1.SOME SIMPLE COMPUTATIONS

Parallel Prefix Computation:

With the same assumptions as in the Semigroup Computation, a parallel prefix computation is defined as simultaneously evaluating all of the prefixes of the expression x0 ⊗ x 1 . . . ⊗ xn–1; Output: x0, x0 ⊗ x1, x0 ⊗ x1 ⊗ x2, . . . , x 0 ⊗ x1 . . . ⊗ ⊗ xn–1.

Note that the ith prefix expression is si = x0 ⊗ x1 . . . ⊗ ⊗ xi. The comment about commutativity, or lack thereof, of the binary operator applies here as well. ⊗

Page 6: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 6 

2.1.SOME SIMPLE COMPUTATIONSx0

identity element

x1

x2

xn–2

x

s

. . .

t = 0

t = 1

t = 2

t = 3

t = n – 1

t = n

n–1

Fig.2.1. Semigroup computation on a uniprocessor

Page 7: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 7 

2.1.SOME SIMPLE COMPUTATIONS

Packet Routing:A packet of information resides at Processor i and must be sent to Processor j.

The problem is to route the packet through intermediate processors, if needed, such that it gets to the destination as quickly as possible

Page 8: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 8 

2.1.SOME SIMPLE COMPUTATIONS

Broadcasting

Given a value a known at a certain processor i, disseminate it to all p processors as quickly as possible, so that at the end, every processor has access to, or “knows,” the value. This is sometimes referred to as one-to-all communication.

The more general case of this operation, i.e., one-to-many communication, is known as multicasting.

Page 9: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 9 

2.1.SOME SIMPLE COMPUTATIONS

Sorting:

Rather than sorting a set of records, each with a key and data

elements, we focus on sorting a set of keys for simplicity. Our

sorting problem is thus defined as:

Given a list of n keys x0, x1, . . . , xn–1, and a total order ≤ on key

values, rearrange the n keys as Xi nondescending order.

Page 10: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 10

2.2.SOME SIMPLE ARCHITECTURES

We define four simple Parallel Architectures:

1. Linear array of processors

2. Binary tree of processors

3. Two-dimensional mesh of processors

4. Multiple processors with shared variables

Page 11: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 11

2.2.SOME SIMPLE ARCHITECTURES

1. Linear array of processors:

Fig. 2.2 A linear array of nine processors and its ring variant.

P2P0 P1 P3 P4 P5 P6 P7 P8

P2P0 P1 P3 P4 P5 P6 P7 P8

Max node degree d = 2Network diameter D = p – 1 ( p/2 )Bisection width B = 1 ( 2 )

Page 12: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 12

2.2.SOME SIMPLE ARCHITECTURES

2. Binary Tree

P1

P0

P3

P4

P2P5

P7 P8

P6

Max node degree d = 3Network diameter D = 2 log2 p ( 1 )Bisection width B = 1

Page 13: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 13

2.2.SOME SIMPLE ARCHITECTURES

3. 2D Mesh

Fig. 2.4 2D mesh of 9 processors and its torus variant.

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

P P P

P P P

P P P

0 1 2

3 4 5

6 7 8

Max node degree d = 4Network diameter D = 2p – 2 ( p )Bisection width B p ( 2p )

Page 14: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 14

2.2.SOME SIMPLE ARCHITECTURES

4. Shared Memory

Fig. 2.5 A shared-variable architecture modeled as a complete graph.

Costly to implementNot scalable

But . . . Conceptually simpleEasy to program

P 1

P 2

P 3

P 4

P 5

P 6

P 7

P 8

P 0

Max node degree d = p – 1Network diameter D = 1

Bisection width B = p/2 p/2

Page 15: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 15

2.3.ALGORITHMS FOR A LINEAR ARRAY

1.Semigroup Computation:

Fig. 2.6 Maximum-finding on a linear array of nine processors.

5 2 8 6 3 7 9 1 4 5 8 8 8 7 9 9 9 4 8 8 8 8 9 9 9 9 9 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Initial values

Maximum identi fied

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

For general semigroup computation:Phase 1: Partial result is propagated from left to rightPhase 2: Result obtained by processor p – 1 is broadcast leftward

Page 16: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 16

2.3.ALGORITHMS FOR A LINEAR ARRAY

2.Parallel Prefix Computation:

Fig. 2.8 Computing prefix sums on a linear array with two items per processor.

5 2 8 6 3 7 9 1 4 1 6 3 2 5 3 6 7 5 5 2 8 6 3 7 9 1 4 6 8 11 8 8 10 15 8 9 + 0 6 14 25 33 41 51 66 74 = 5 8 22 31 36 48 60 67 78 6 14 25 33 41 51 66 74 83

Ini tial values

Final results

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

Local prefixes

Diminished prefixes

Page 17: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 17

2.3.ALGORITHMS FOR A LINEAR ARRAY

3.Packet Routing:

To send a packet of information from Processor i to Processor j on a linear array, we simply attach a routing tag with the value j – i to it.

The sign of a routing tag determines the direction in which it should move (+ = right, – = left) while its magnitude indicates the action to be performed (0 = remove the packet, nonzero = forward the packet). With each forwarding, the magnitude of the routing tag is decremented by 1

Page 18: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 18

2.3.ALGORITHMS FOR A LINEAR ARRAY

4.Broadcasting:

If Processor i wants to broadcast a value a to all processors, it sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor.

Any processor receiving an rbcast(a ) message, simply copies the value a and forwards the message to its right neighbor (if any). Similarly, receiving an lbcast(a) message causes a to be copied locally and the message forwarded to the left neighbor.

The worst-case number of communication steps for broadcasting is p – 1.

Page 19: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 19

2.3.ALGORITHMS FOR A LINEAR ARRAY

5.Sorting:

Fig. 2.9 Sorting on a linear array with the keys input sequentially from the left.

5 2 8 6 3 7 9 1

5 2 8 6 3 7 9

5 2 8 6 3 7 5 2 8 6 3 5 2 8 6 5 2 8

5 2 5

5 2 8 6 3 7 9 1 4

4

1 4

4 9 1

1 9 4 7

1 7 4 3 9

1 3

4 7 9

1

1 2

3

3

4 7 6

8

8

6

9

4 6 7 9

1

1

1

4 9 1

1 9 4 7

1

7 3

9

1

3

4 7 9

1

1

2

3

3

4 7 6

8

8

6

9 4

5

6 7 9

5

5

5

5

2

2

2 8

8

6 6

2

2

2

2

2

3

3

3

3

3

4

4

4

4

5

5

5

5 6

6

6

6

7

7

7

8

8

8 9

9

7 8

8

9

Page 20: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 20

2.3.ALGORITHMS FOR A LINEAR ARRAY

Fig. 2.10 Odd-even transposition sort on a linear array.

5 2 8 6 3 7 9 1 4 5 2 8 3 6 7 9 1 4 2 5 3 8 6 7 1 9 4 2 3 5 6 8 1 7 4 9 2 3 5 6 1 8 4 7 9 2 3 5 1 6 4 8 7 9 2 3 1 5 4 6 7 8 9 2 1 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

In odd steps, 1, 3, 5, etc., odd-numbered processors exchange values with their right neighbors

0 P 1 P 2 P 3 P P P 6 P 7 P 8 P 5 4

T(1) = W(1) = p log2 p T(p) = p W(p) p2/2

S(p) = log2 p R(p) = p/(2 log2 p)

Page 21: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 21

2.4.ALGORITHMS FOR A Binary Tree

1. Semigroup Computation

P1

P0

P3

P4

P2P5

P7 P8

P6

Reduction computation and broadcasting on a binary tree.

Page 22: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 22

2.4.ALGORITHMS FOR A BINARY TREE

2. Parallel Prefix Computation

Fig. 2.11 Scan computation on a binary tree of processors.

Page 23: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 23

2.4.ALGORITHMS FOR A BINARY TREE

Some applications of the parallel prefix computation

Ranks of 1s in a list of 0s/1s: Data: 0 0 1 0 1 0 0 1 1 1 0 Prefix sums: 0 0 1 1 2 2 2 3 4 5 5 Ranks of 1s: 1 2 3 4 5

Priority arbitration circuit: Data: 0 0 1 0 1 0 0 1 1 1 0 Dim’d prefix ORs: 0 0 0 1 1 1 1 1 1 1 1 Complement: 1 1 1 0 0 0 0 0 0 0 0 AND with data: 0 0 1 0 0 0 0 0 0 0 0

Page 24: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 24

2.4.ALGORITHMS FOR A BINARY TREE

3. Packet Routing

P1

P0

P3

P4

P2P5

P7 P8

P6

Preorder indexing

If dest = selfThen remove the packetElse if dest< self or dest>maxr then rout upwardElse if dest<= maxl then rout leftward else rout rightward end if end ifEnd if

Page 25: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 25

2.4.ALGORITHMS FOR A BINARY TREE

4.Sorting

if you have 2 itemsthen do nothingelse if you have 1 item that came from the left (right) then get the smaller item from the right (left) child else get the smaller item from each child endifendif

Page 26: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 26

2.4.ALGORITHMS FOR A BINARY TREE

Fig. 2.12 The first few steps of the sorting algorithm on a binary tree.

(a) (b)

(c) (d)

5 2 3

1 4 5 2

1 4

3

2

5 1 3

4

2

5

1

3

4

Page 27: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 27

2.5.ALGORITHMS FOR A 2D MESH

1. Semigroup computation:

To perform a semigroup computation on a 2D mesh, do the semigroup computation in each row and then in each column.

For example, in finding the maximum of a set of p values, stored one per processor, the row maximums are computed first and made available to every processor in the row. Then column maximums are identified.

This takes 4 p – 4 steps on a p-processor square mesh

Page 28: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 28

2.5.ALGORITHMS FOR A 2D MESH

2. Parallel Prefix computation:

This can be done in three phases, assuming that the processors (and their stored values) are indexed in row-major order:

1.do a parallel prefix computation on each row2.do a diminished parallel prefix computation in the rightmost column.3.broadcast the results in the rightmost column to all of the elements in the respective rows and combine with the initially computed row prefix value.

Page 29: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 29

2.5.ALGORITHMS FOR A 2D MESH

3. Packet Routing:To route a data packet from the processor in Row r, Column c, to the processor in Row r', Column c', we first route it within Row r to Column c'. Then, we route it in Column c' from Row r to Row r'

4. Broadcasting:Broadcasting is done in two phases: (1) broadcast the packet to every processor in the source node’s row and (2) broadcast in all columns

Page 30: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 30

2.5.ALGORITHMS FOR A 2D MESH

5. Sorting:

5 2 8 2 5 8 1 4 3 1 3 4 1 3 2 1 2 3 6 3 7 7 6 3 2 5 8 8 5 2 6 5 4 4 5 6 9 1 4 1 4 9 7 6 9 6 7 9 8 7 9 7 8 9

Initial values Snake-like row sort

Top-to-bottom column sort

Snake-like row sort

Top-to-bottom column sort

Left-to-right row sort

Phase 1

Phase 2

Phase 3

Figure 2.14. The Shearsort algorithm on 3*3 mesh

Page 31: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 31

2.5.ALGORITHMS FOR A 2D MESH

5. Sorting:

5 2 8 2 5 8 1 4 3 1 3 4 1 3 2 1 2 3 6 3 7 7 6 3 2 5 8 8 5 2 6 5 4 4 5 6 9 1 4 1 4 9 7 6 9 6 7 9 8 7 9 7 8 9

Initial values Snake-like row sort

Top-to-bottom column sort

Snake-like row sort

Top-to-bottom column sort

Left-to-right row sort

Phase 1

Phase 2

Phase 3

Figure 2.14. The Shearsort algorithm on 3*3 mesh

Page 32: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 32

2.6.ALGORITHMS WITH SHARED VARIABLES

1. Semigroup Computation:Each processor obtains the data items from all other processors and

performs the semigroup computation independently. Obviously, all processors will end up with the same result

2. Parallel Prefix Computation:Similar to the semigroup computation, except that each processor

only obtains data items from processors with smaller indices.

3. Packet RoutingTrivial in view of the direct communication path between any pairof processors.

Page 33: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 33

2.6.ALGORITHMS WITH SHARED VARIABLES

4. Broadcasting:Trivial, as each processor can send a data item to all processors directly. In fact, because of this direct access, broadcasting is not needed; each processor already has access to any data item when needed.

5. Sorting:The algorithm to be described for sorting with shared variables consists of two phases: ranking and data permutation. Ranking consists of determining the relative order of each key in the final sorted list.

Page 34: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 34

2.6.ALGORITHMS WITH SHARED VARIABLES

If each processor holds one key, then once the ranks are determined, the jth-ranked key can be sent to Processor j in the data permutation phase, requiring a single parallel communication step. Processor i is responsible for ranking its own key xi.

This is done by comparing xi to all other keys and counting the number of keys that are smaller than xi. In the case of equal key values, processor indices are used to establish the relative order.

Page 35: Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 35

THE END