8. external sorting

42
8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

Upload: jayme

Post on 14-Jan-2016

64 views

Category:

Documents


0 download

DESCRIPTION

8. External Sorting. Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 8.  External Sorting

8. External Sorting

Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer.

What shall we do?

Need to use EXTERNAL STORAGE DEVICE !!!

External Sorting

- Disk Sort

- Tape Sort

What is a major difference between two external sorts?

Page 2: 8.  External Sorting

Sorting with Disk

k - way merging

“mergesort”

merge

internal sort

......

......

Page 3: 8.  External Sorting

Example

4500 records

250 records/block

available memory = 3 blocks

Def’n : A segment of a file is said to be a run if all the records in the segment are sorted.

1 2 3 4 5 6

I

1 3 5

D1 ……

2 4 6

D2 ……

Page 4: 8.  External Sorting

3

D1 D2

……

6 n

D3 D4

2

n

: the size of a run

Page 5: 8.  External Sorting

1 3 5 7

Run size 2 4 6 8

1 3 5 7 2 4 6 8

3

12 34 56 78

6

1256 3478

12

12345678

24

How many passes?

1 + log2r

(r # of initial runs)

Page 6: 8.  External Sorting

a

nn

ar

rn

an

2

2

log

,

)log(

O

size. run initial the

O

operations I/O of #

Page 7: 8.  External Sorting

k-way merging

… … …… …

……

logkr ……………………………………………….

……

# of passes

1+logkr

# of I/O operations?

O(nlogkr)

better than 2-way merging !!!

Page 8: 8.  External Sorting

How about # of comparisons?

Is k-way merging always better than 2-way merging?

Page 9: 8.  External Sorting

Replacement Selection

… … …… …

……

……………………………………………….

……

# of passes

1+logkr #(P)

#(P) k rr run size

Page 10: 8.  External Sorting

# of comparisons(k-way merge)

16 38 30 25 50 16 110 20

15 20 20 25 15 11 120 18

10 9 20 15 8 9 90 17

10 9 20 15 8 9 90 17

15 8 17

9 8

8

8

9

8 9

1

32

4 5 6 7

10 11 12 13 14 15

8

Page 11: 8.  External Sorting

How many comparisons in a pass?

nlog2k why?

Total # of comparisons?

(# of passes) (# of comparisons in a pass)

= (logkr)(nlog2k)

= (nlog2r) independent of k !!!

#(c) r

Page 12: 8.  External Sorting

How to increase run size(initial run size)

x1, x2, x3,…,xm, xm+1, xm+2, xm+3,…,x2m, x2m+1, x2m+2, x2m+3,…

m keys m keys m keys

r = # of runs = Any improvement?

Observation

See p.94 in textbook

!!!

…...

m

n

m

nr

Page 13: 8.  External Sorting

4,2,32,12,18,24,91,11

(record size >> the size of pointer)

why do we need this?

11

91

24

18

11

18

11

4

5

6

7

2

3

Page 14: 8.  External Sorting

A tree of losers

4 parent

2 loser

32

12 Updating pointers

18 ptr := winner.parent;

24 while ptr nil do

91 if (ptr.loser.key < winner.key) then

11 interchange(ptr.loser, winner);

end {if}

ptr := ptr.parent;

end {while}

11 91

winner

1824

Page 15: 8.  External Sorting

Explain p.97-101, textbook !!!

Exercise :

In a complete 2-tree(T) with n leaf nodes,

show that

total # of nodes in T = 2n -1

Page 16: 8.  External Sorting

Performance Analysis

(Average size of runs)

m0 # of records in (real) memory.

H. Seward (M.S. Thesis, MIT, 1954)

gave a good reason to believe that a run contains more than 1.5m0 records

(no proof)

E. Friend (JACM, 3, (1966))

experiment 2m0

E. Moore (1961)

Proved that 2m0 is the expected run length.

Page 17: 8.  External Sorting

Sketch of Moore’s Proof

Snowplow

falling snow

2m0 m0

uniform distribution 2m0

Page 18: 8.  External Sorting

Tape Sorting

• Balanced k-way merging

(similar to disk sorting)

• Polyphase merging

• Cascade merging

Page 19: 8.  External Sorting

Polyphase Merging (Motivation)– (R1, R2, …, R5000)– length (Ri) 20 bytes– Only 1000 records fitted in the internal memory at one time.

( 20k bytes)– 4 tapes available

Balanced 2-way mergeT1 T2 T3 T4

R1,1000 R1001,2000

R2001,3000 R3001,4000 R4001,5000

R1,2000 R2001,4000

R4001,5000

R1,4000 R4001,5000 R1,5000

Total # of operations = 15000

Page 20: 8.  External Sorting

Tape 1 Tape 2 Tape 3 Tape 4

R1,1000 R1001,2000 R2001,3000

R3001,4000 R4001,5000

(rewind)

R3001,4000 R4001,5000 R1,3000

R1,5000

• Total # of I/O operations

3000 + 5000 = 8000

Balanced Merge is not always best !!!

Page 21: 8.  External Sorting

What if only 3 tapes available?

Tape 1 Tape 2 Tape 3

R1,1000 R1001,2000

R2001,3000 R3001,4000

R4001,5000

R1,2000

R2001,4000

R4001,5000

R1,2000 R2001,4000

R4001,5000

R1,4000

R4001,5000

R4001,5000 R1,4000

R1,5000

Total # of I/O Operations

5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!

Page 22: 8.  External Sorting

Tape 1 Tape 2 Tape 3

R1,1000 R1001,2000

R2001,3000 R3001,4000

R4001,5000

R1,2000

R4001,5000 R2001,4000

(rewind)

R1,2000; 4001,5000

(rewind)

R1,5000

Total # of I/O Operations

4000 + 3000 + 5000 = 11,000 !!!

4000,2001R

Page 23: 8.  External Sorting

Polyphase merge

T1 T2 T3 T4 T5 T6

131 130 128 124 116 115 114 112 18 516

17 16 14 98 58

13 12 174 94 54

11 332 172 92 52

651 331 171 91 51

1291

How to assign initial runs?

Page 24: 8.  External Sorting

Cascade MergeT1 T2 T3 T4 T5 T6

155 150 141 129 115 140 135 126 114 515

Pass 1 126 121 112 414 515

114 19 312 414 515

15 29 312 414 515

( 15 29 312 414 515)

155 24 37 49 510

155 144 33 45 56

Pass 2 155 144 123 42 53

155 144 123 92 51

(155 144 123 92 51 )

154 143 122 91 551

153 142 121 501 551

Pass 3 152 141 411 501 551

151 291 411 501 551

( 151 291 411 501 551)

Pass 4 1901

Page 25: 8.  External Sorting

Polyphase Merge

T1 T2 T3 T4 T5 T6

phase 1 131 130 128 124 116 2 115 114 112 18 516

3 17 16 14 98 58

4 13 12 174 94 54 Gilstad(1960)

5 11 332 172 92 52

6 651 331 171 91 51

7 1291

{{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4},

{16,15,14,12,8},{31,30,28,24,16}}

Perfect Fibonacci Distribution !!!

What is the underlying rule?

Page 26: 8.  External Sorting

i ai bi ci di ei

0 1 0 0 0 0

1 1 1 1 1 1

2 2 2 2 2 1

3 4 4 4 3 2

4 8 8 7 6 4

5 16 15 14 12 8

6 31 30 28 24 16

Page 27: 8.  External Sorting

(a0 + b0) (a0 + c0) (a0 + d0) (a0 + e0) a0

(a1 + b1) (a1 + c1) (a1 + d1) (a1 + e1) a1

(a2 + b2) (a2 + c2) (a2 + d2) (a2 + e2) a2

n an bn cn dn en

n+1 an + bn an + cn an + dn an + en an

an bn cn dn en

Page 28: 8.  External Sorting

i ai bi ci di ei output

0 1 0 0 0 0 T6

1 1 1 1 1 1 T1

2 2 2 2 2 1 T2

3 4 4 4 3 2 T3

2 2 2 1 0 2

1 1 1 0 1 1

4 8 8 7 6 4 T4

5 16 15 14 12 8 T5

6 31 30 28 24 16 T6

7 61 59 55 47 31

T1 T2 T3 T4 T5

Page 29: 8.  External Sorting

n-1 an-1 bn-1 cn-1 dn-1 en-1

n an-1+bn-1 an-1+cn-1 an-1+dn-1 an-1+en-1 an-1

an bn cn dn en

en = an-1

dn = an-1 + en = an-1 + an-2

cn = an-1 + dn-1 = an-1 + (an-2 + en-2) = an-1 + an-2 + an-3

………….

en = an-1

dn = an-1 + an-2

cn = an-1 + an-2 + an-3

bn = an-1 + an-2 + an-3 + an-4

an = an-1 + an-2 + an-3 + an-4 + an-5

(a0 = 1, ai = 0, i = -1, -2, -3, -4)

Page 30: 8.  External Sorting

e = an-1

d = an-1 + an-2

c = an-1 + an-2 + an-3

b = an-1 + an-2 + an-3 + an-4

a = an-1 + an-2 + an-3 + an-4 + an-4

Page 31: 8.  External Sorting

i -4 -3 -2 -1 0 1 2 3 4 5 6 7

ai 0 0 0 0 1 1 2 4 8 16 31 61

1

bi 0

ci 0

di 0

ei 0

Page 32: 8.  External Sorting

1 2 4 8 16 31 61

1 2 4 8 15 30 59

1 2 4 7 14 28 55

1 2 3 6 12 24 47

1 1 2 4 8 16 31

Page 33: 8.  External Sorting

ai = < 0, 0, 0, 0, 1, 1, 2, 4, 8, 16, 31, 61, …… >, i = -4, -3, -2, -1, 0, 1, 2,...“The kth order Fibonacci number”

Fnk = Fn-1

k + Fn-2k + …… + Fn-k

k

0, 0 n k-2 Fn

k = 1, n = k-1

e.g)The second order Fibonacci number

0 1 1 2 3 5 ……

Fn2 = Fn-1

2 + Fn-22

0, if n = 0 Fn

2 = 1, if n = 1

Fibonacci number !!!

an = Fn+k-1k if k tapes(input) are used

why?

Page 34: 8.  External Sorting

What if not perfect Fib. Dist’n?

Use dummy runs !!!

5 input tapes and 53 initial runs.

Level T1 T2 T3 T4 T5

1 1 1 1 1 1 5

2 2 2 2 2 1 91 1 1 1 0

3 4 4 4 3 2 172 2 2 1 1

4 8 8 7 6 4 334 4 3 3 2

5 16 15 14 12 8 65>53(8 7 7 6 4)………………………………

T1 T2 T3 T4 T5

(34)(35) (36) (37)(38) (39) (40) (41)(42) (43) (44) (45)(46) (47) (48) (49) (50)(51) (52) (53)

Page 35: 8.  External Sorting

T1 T2 T3 T4 T5 T6

(2) (2) (2) (3) (3)

18 17 16 14 58

(2) (2) (2) (3) 55

53

not best

but simple and good !!!

For better one, see Knuth !!!

1111

1111

1111

161 151 141 121 141

Page 36: 8.  External Sorting

Example (3 tapes)

T1 T2 T3

(k)8 (k)5 (k)3 (2k)5

(3k)3 (2k)2 0, 1, 1, 2, 3, 5, 8

(5k)2 (3k)1 (5k)1 (8k)1

(13k)1

Runs on two input tapes (k)

# of runs run size(k) # of pairs # of I/O’s

8,5 1,1 5 10

5,3 2,1 3 9

3,2 3,2 2 10

2,1 5,3 1 8

1,1 8,5 1 13

1 13

How many passes over the data?

Page 37: 8.  External Sorting

Total number Fs for some s.

of initial runs

the sth Fibonacci number

Fs

Fs-1 Fs-2

T1 T2 T3

Fs-1 Fs-2

Fs-3 Fs-2

Fs-3 Fs-4

…………

See Fig. p.107, textbook !!!

Total # of I/O operations =

# of passes =

2

11

s

iisi kFF

s

s

iisi

s

s

iisi

F

FF

kF

kFF

2

11

2

11

Page 38: 8.  External Sorting

Lemma :

[proof] (By induction on S)

(s=2) LHS =

RHS =

(s=3) LHS =

RHS =

(s=k) Suppose that

(s=k+1)

Exercise !!!

See page 106-107 in textbook !!!

2,5

22

5

51

2

11

sF

sF

sFF ss

s

iisi

00

11

iisi FF

05

6

5

6

5

24

5

5223

FF

231

1

11

FFFF

iisi

25

16

5

6

5

26

5

5334

FF

kkFk

Fk

FF kk

k

iiki

'4,

5

2'2

5

5''1'

2'

11'

Page 39: 8.  External Sorting

From the previous lemma,

# of passes =

Fs = r

(1)

why?

. Golden Ratio !!!

From (1) ,

5

22

5

5

522

55

1

1

2

11

s

F

Fs

F

Fs

Fs

F

FF

s

s

s

ss

s

s

iisi

KK

kF 512

151

2

1

5

1

k k

kF

51

2

1

5

1

8

131

j

j

F

F

ss F

Fs log43.167.1

1)51log(

log5log

5jfor

Page 40: 8.  External Sorting

Theorem:

Fs-1 Fs-2

Polyphase merge

merge 3 tapes

Fs = r = # of initial runs

# of passes = 1.04 log2r

Page 41: 8.  External Sorting

APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING

Tapes Phases Passes Pass/phase Growth percent ratio

3 2.078 lnS + 0.672 1.504 lnS + 0.992 72 1.6180340

4 1.641 lnS + 0.364 1.015 lnS + 0.965 62 1.8392868

5 1.524 lnS + 0.078 0.863 lnS + 0.921 57 1.9275620

6 1.479 lnS + 0.185 0.795 lnS + 0.864 54 1.9659482

7 1.460 lnS + 0.424 0.762 lnS + 0.797 52 1.9835828

8 1.451 lnS + 0.642 0.744 lnS + 0.723 51 1.9919642

9 1.447 lnS + 0.838 0.734 lnS + 0.646 51 1.9960312

10 1.445 lnS + 1.017 0.728 lnS + 0.568 50 1.9980295

20 1.443 lnS + 2.170 0.721 lnS – 0.030 50 1.9999981

APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING

Tapes Phases Passes Growth ratio

3 2.078 lnS + 0.672 1.504 lnS + 0.992 1.6180840

4 1.235 lnS + 0.754 1.012 lnS + 0.820 2.2469796

5 0.946 lnS + 0.796 0.897 lnS + 0.800 2.8793852

6 0.796 lnS + 0.821 0.773 lnS + 0.808 3.5133371

7 0.703 lnS + 0.839 0.691 lnS + 0.822 4.1481149

8 0.639 lnS + 0.852 0.632 lnS + 0.834 4.7833861

9 0.592 lnS + 0.861 0.587 lnS + 0.845 5.4189757

10 0.555 lnS + 0.869 0.552 lnS + 0.854 6.0547828

20 0.397 lnS + 0.905 0.397 lnS + 0.901 12.4174426

Page 42: 8.  External Sorting

Cascade Merge

Level ai bi ci di ei

0 1 0 0 0 0

1 1 1 1 1 1

2 5 4 3 2 1

3 15 14 12 9 5

4 55 50 41 29 15

n an bn cn dn en

n+1 an+bn+cn an+1 bn+1 cn+1 dn+1

+dn+en -en -dn -cn -bn

an+1 an

Perfect dist’n

for detail see Knuth Vol III !!!