parallel sorting… n data values sequential algorithms ~ o( n.log n) except special cases....

7
Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: • as n huge, memory must be distributed want some parallel speedup. on p processes nlocal nglobal / nprocs. O( nglobal(log nglobal+1) ) possible with parallel sort. Naive solution… Distributed sequential sort method. Still ~ O( nglobal.log nglobal ) + communication cost: as n huge gets painful because many short messages passed. no parallel speedup optimal for memory space Consider lightly optimized bubble sort. O(n 2 )

Post on 21-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

Parallel sorting…

• n data values • sequential algorithms ~ O( n.log n) except special cases.• motivation:

• as n huge, memory must be distributed• want some parallel speedup.

• on p processes nlocal nglobal / nprocs.• O(nglobal(log nglobal+1) ) possible with parallel sort.

Naive solution…

• Distributed sequential sort method.• Still ~ O( nglobal.log nglobal ) + communication cost:

• as n huge gets painful because many short messages passed.• no parallel speedup• optimal for memory space

• Consider lightly optimized bubble sort.• O(n2)

Page 2: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

S O R T I N G E X A M P L E O N E AS O R T I N G E X A M P L E O N E A

SO R TI N GE XA M P LE ONEA

SO R TI N GE XA M P LE ONEA

Partition

Sort

Compare/swapIA

SO R TN GE XM P LE ONEA Compare/swapIA NIE IA E

SO R TG X MP LE ONEA NIA E Sort

Compare/swapSO R TG X MP LE ONEA NIA E

Compare/swap

OI

Compare/swap

S RTG XM PLE ONEA NA E OI SR T XPO SR T XPO Sort

DONE

Distributed sequential sort…

Page 3: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

S O R T I N G E X A M P L E O N E AS O R T I N G E X A M P L E O N E B

Partition odd/even

Odd-even sort…

S O R G E X

A M P

L E O

N E AT I N

SO R GE X

A M P

LE O

NEATI N

Sort data[0,…,nlocal-1]A M P NEATI N Odds send leftS O R G E X L E OM P X L N OR S TOI N EA G EA E Evens mergeEvens return high

M P X L N OR S T

Evens send left

EA G EA E

Odds merge

M P XR S TEA G EA E

Odds return highM P XR S T

M P X L N OR S TOI N

L N OM P XR S TEA G EA E

M P XR S T Odds send leftL N OEA G EA E Evens mergeOI N M P XR S T L N OEA G EA E Evens return high

OI N P XR S T O

Evens send left

ML NEA EOI N R S TML NEA E

Odds mergeOdds return highOI N R S T

OI N P XR S T OEA G

P XOOI N R S TML NEA E

OI N R S T Odds send leftML NEA E Evens mergeI P XOEA G OI N R S TML NEA E Evens return high

XE G ON S TNE

DONEceil(nprocs/2) iterations

p1

p0 p2

p3

p4

p5First iteration

Second iteration

Third iteration

Page 4: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

Odd-even sort… (subtleties)

• Allocate contiguous memory for mergeint * data, *buffer;data = (int*)calloc(2*nlocal, sizeof(int)); // 2*nlocal ints allocated. data [0,…,nlocal-1] is significantbuffer = &data[nlocal]; // buffer points to second half of the data[ ] array

• Need to figure out the left and right neighbors:left_neighbor = mype-1; // mype will send nlocal ints from &data[0] to left_neighbor’s &buffer[0]right_neighbor = mype+1; // mype will send nlocal ints from &buffer[0] to right_neighbor’s &data[0]

• BUT…

• …the left-most pe (mype == 0) has nowhere to send the low values in data[0,…,nlocal-1] since it has no left neighbor.

• This pe’s left_neighbor should be set to MPI_PROC_NULL. Sending to MPI_PROC_NULL returns immediately without doing anything.

• AND …

• …the right-most pe (mype == nprocs-1) has nowhere to return the high data pointed to by buffer because it has no right neighbor.

• This pe should send the high values in &buffer[0,nlocal-1] to MPI_PROC_NULL, and• EITHER

• When initializing data[ ], this pe should also pad buffer[0,…,nlocal-1] with INT_MAX (assuming we’re sorting ints) so that the merge step operates correctly,

• OR • This pe should not execute the merge step of the iteration.

Page 5: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

Shearsort…

• Have seen treating processes as 1D array decomposed into odd/even.

• How about odd-even sort on 2D array of processes?

• Big speed-up requires special ordering

O 1 2 3

7 6 5 4

8 9 10 11

15 14 13 12

Smallestnumber

Largestnumber

Page 6: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

Shearsort…

• For i = 1,2,3,…,log(n)+1

• If i is odd• sort even rows biggest at left, smallest at right• sort odd rows smallest at left, biggest at right

• If i is even• sort all columns so smallest number is at top, and biggest at bottom

S O R T

E G N I

X A M P

N O E L

O R S T

N I G E

A M P X

O N L E

A I G E

N M L E

N N P T

O R S X

A E G I

N M L E

N N P T

X S R O

A E G E

N M L I

N N P O

X S R T

E E G

M L I

N N PO

X T S R

A

Nn=1n=2n=3n=4n=5

Done

Page 7: Parallel sorting… n data values sequential algorithms ~ O( n.log n) except special cases. motivation: as n  huge, memory must be distributed want some

Other sorts…

• Bucketsort: • Pe’s partition their data into small buckets.• Pe’s send appropriate chunks to “large buckets” on each pe.• Pe’s sort “large buckets”.• (Optional: preprocessing stage where master collects info on distribution,

assigns buckets to slaves. This procedure deals with non-uniform distribution problems).

• Parallel mergesort O(log2n)• Odd-even implementation of standard mergesort.

• Parallel quicksort: O(n)• master-slave, difficult to balance sub-tasks.• tree implementation – even harder to balance tree• hypercube topology can be optimal!• still O(n2) worst case.