r. dementiev and p. sanders: asynchronous parallel disk sorting 1 asynchronous parallel disk sorting...

23
R. Dementiev and P. Sanders: Asynchronous Parallel Di sk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut für Informatik, Saarbrücken, Germany Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter

Upload: lisandro-cosby

Post on 29-Mar-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting1

Asynchronous Parallel Disk Sorting

Roman Dementiev and Peter Sanders

Max-Planck-Institut für Informatik,Saarbrücken, Germany

Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter

Page 2: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting2

Motivation & Related Work Sorting is important for EM problems Disks are cheaper than computers Prefetch buffers for load disk balancing and overlapping have been

studied but no results which guarantee– overlapping during merging of runs– asymptotically optimal I/O volume

Here: algorithm & implementation– I/O cost lower bound – guarantees almost perfect overlapping between I/O and computation

Page 3: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting3

Motivation & Related Work [ChaudhryCormen’01’02] impl. of distributed memory, parallel disk Theory: Implementations: Differences:

– Optimal prefetching algorithm– Overlapping of I/O and computation– Completely asynchronous implementation– Part of <stxxl> library, compatible with STL. Use as simple as:

stxxl::vector<my_type> v;long long int i=vec_size; // ‘long long int’ is a 64-bit data typewhile(i--)

v.push_back(f(i)); // fill vector with some valuesstxxl::sort(v.begin(), v.end()); // sort// check order using STL predicateassert(std::is_sorted(v.begin(), v.end()));

[BarvGrovVit’97] [SandEgnKorst’00] [HutchVitt’01] [HutchSandVitt’02]

Here[DementievSanders’03]

[BarveVitter’99]

Page 4: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting4

External Multi-way Merge Sortru

n 0

input:

O(M)

chunk 0chunk 1 chunk 2 chunk 3chunk 4

run

1ru

n 2

run

3chunk 5 chunk 6 …

Bbuffers B B B

k=O(M/B)

k-mergerru

n 4

run

5ru

n 6

run

7

B B B B

k-merger

chunk 7chunk 8

k

run

012

3

Bru

n 45

67

B

buffers

These algorithms do not guarantee overlapping

D w

rite

bu

ffe

rs

1234

k

… .. mer

ge

elements

D blocks

me

rge

b

uff

ers

merging

D p

refe

tch

bu

ffe

rsD-head disk

D-head model

N – input size

M – main memory size

B – block size

D – number of disks

Page 5: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting5

Multi-way Merge Sort with Overlapped I/Os

Theorem 1:Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time

where is the time needed for run formation,

is the time needed for merging k=(M/B) : # sequences,

k’ = O(N/M): total # runs.

mBMformruns TkT 'log )/(

upstartOIsortformruns TNTk

NTkT

/,

''max

upstartintmergeOIm TNTNTT ,max /

Page 6: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting6

Overlapping Run Formation

Easy:

Corollary 2:Corollary 2: Input of size N => sorted runs of size ~ M/2 in time

2M in the paper

time

read

sort

write

read

sort

write

1 2

1 1

2

3

3

3

4

2

4

4

k-1

k-1

k-1

k

k

k

...

...

...

1 2

1

1

2

3

3

2

4

4

3 4

k-1

k-1

k-1

k

k

k

I/O bound

case

compute

bound case

startupOIsort TNTM

NMT

/,2

2max

run numbers

Page 7: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting7

Simple External Merging Smallest element of a block is a trigger

k sorted runs….merger

sort pairs <trigger,block_id>

prediction sequence :

….

k-merger

….

Page 8: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting8

Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy:

1B-12 3B-14 5B-16 …

1B-12 3B-14 5B-16 …

1B-12 3B-14 5B-16 …

k…

merger

Page 9: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting9

Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy:

Solution:

2 3B-14 5B-16 …

2 3B-14 5B-16 …

2 3B-14 5B-16 …

k…

merger

D w

rite

bu

ffe

rs

1234

k

… .. mer

ge

elements

D blocks

me

rge

b

uff

ers

merging

D p

refe

tch

bu

ffe

rsD-head disk

2D

writ

e b

uff

ers

1234

k

… .. mer

ge

elements

D blocks

me

rge

b

uff

ers

merging

D p

refe

tch

bu

ffe

rsD-head disk

k+3Doverlapbuffers

Page 10: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting10

Overlapping I/O and Merging [contd.] Theorem 4:Theorem 4: Merging k sequences with a total of N’ elements takes

time

– ℓ: time to produce one element of output – L: time to input or output D arbitrary blocks

Our I/O thread strategy: DB elements in write buffer => output step– < DB elements in write buffer AND D overlap buffers avail. => input step

To prove Theorem 4 we distinguish two cases– I/O bound case: 2L DBℓ– Compute bound case: 2L< DBℓ

startupTNDB

LN

','2

max

Page 11: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting11

Compute Bound Case Lemma 6:Lemma 6: In compute bound case

after k/D+1 steps, the merging thread never blocks until all elements are merged.

Notation: # elem. in write buffer– r # of elem. in the overlap and

merge buffers Our I/O thread strategy:

DB elements in write buffer => output

step– < DB elements in write buffer AND D

overlap buffers avail. => input step

r

kB

kB+DB

kB+2DB

kB+3DB

kB+y L/ℓ

kB+2 L/ℓ

0 DB 2DB- L/ℓ 2DB

merge thread blocked

I/O blocking

fetch output

Page 12: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting12

I/O Bound Case Lemma 7:Lemma 7: In I/O bound case the I/O

thread never blocks until all input blocks are fetched.

Our I/O thread strategy (the same): DB elements in write buffer

=> output step

– < DB elements in write buffer AND D overlap buffers avail. => input step

Proof: Proof: similar to compute bound case

r

blocking

fetch

output

DB- L/ℓ

DB 2DB- L/ℓ

2DB

kB+3DB

kB+2DB+ L/ℓ

kB+2DB

kB+DB+ L/ℓ

kB+DB

0kB+ L/ℓ

Page 13: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting13

Multi-head Model Multi-disk Model Randomized Cyclic Allocation [VitHut’01] makes accesses in

balanced

Optimal prefetching from [HutchSanVitt’01,SanEgnKor’00] Corollary 8:Corollary 8: For any > 0, prefetch buffer of size m=(D/) merging k

sequences with a total N’ elements can be implemented to run in time

startupTNDB

LN

',1

'2max

…………

-ping

2D

writ

e b

uff

ers

O(D

) pr

efet

ch b

uffe

rs

k+O(D)overlap buffers

1

234

k

… .. mer

ge

elements

read buffers

D blocks

me

rge

bu

ffe

rs

merging

overlap-disk scheduling

Page 14: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting14

Implementation Issues Implementation (part of <stxxl> library)

– Forming runs: Key sorting, efficient two passes MSD radix sort– Multi-way merging: [Sanders’00]– prefetch buffer + overlap buffer = read buffer– Asynchronous I/O (POSIX threads)– No superfluous copying

User blocks are transferred by DMA (O_DIRECT flag) Buffers are passed by pointer between pools

Applications

Sorting, Priority queues,Search trees

OS

Caching, Mem. managementDisk scheduling

POSIX threads,File system

STL + more

<stxxl> s

tru

ctu

re

Page 15: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting15

Hardware

3000 Euro, 375 MByte/s, July 2002

PCIBusses

Controller

Channels

2x Xeon

2x64x66 Mb/s

4 Threads

E7500 12

8

400x64 Mb/s

MB/s4x2x100

8x80GBMB/s

Chipset

Intel

8x45

RAMDDR1 GB

Page 16: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting16

Single Disk Performance LEDA-SM, TPIE 2 GB volume, 32-bit keys, runs of size

256 MB, g++ 3.2 –O6 I/O bandwidth 45.4 Mbyte/s = 95% of

peak bandwidth of the disk

Page 17: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting17

Multiple Disk Performance 2 GB volume, 32-bit keys, runs of size

256 MB, g++ 3.2 –O6 TPIE, LEDA-SM

– Linux Soft-RAID 8X128KByte stripes

Bandwidth of 315 MByte/s

Page 18: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting18

Element Size 16GB volume, 32-bit keys, runs of

size 256 MB, g++ 3.2 –O6 64 merging is I/O bound 128 run formation is I/O bound Small elements require special

treatment

Page 19: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting19

Optimal Prefetching

2D

writ

e b

uff

ers

k+O(D)overlap buffers

1234

k

… .. mer

ge

read buffers

D blocks

me

rge

bu

ffe

rs

merging

2D

writ

e b

uff

ers

O(D

) pr

efet

ch b

uffe

rs

k+O(D)overlap buffers

1234

k

… .. mer

ge

read buffers

D blocks

me

rge

bu

ffe

rs

merging

Page 20: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting20

Block Size Block sizes of several MB => good

performance Leave room for read and write

buffers Naive striping implies factor 8 block

size reduction Dramatic run time increase

Page 21: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting21

Large Inputs One-pass, curves go up because of

– Slower zones– Seek times

Page 22: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting22

Summary Parallel disk sorting with high performance on state of art hardware

with theoretical performance guarantees Bandwidth is no longer a limiting factor for external memory

algorithms Price performance ratio can improve by adding disks

Page 23: R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting23

?