lecture 11: external sorting - github pages · external merge sort algorithm (2-way sort) disk main...

Post on 14-May-2020

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture11:ExternalSorting

Lecture11

Announcements

1. MidtermReview:ThisFriday!

2. ProjectPart#2isout.ImplementCLOCK!

3. MidtermMaterial:EverythinguptoBuffermanagement.1. Today’slectureisnofairgame.2. Donotforgettogoovertheactivitiesaswell!

4. OurusualWednesdayupdate….

2

Lecture11

Lecture11:ExternalSorting

Lecture11

Whatyouwilllearnaboutinthissection

1. ExternalMerge(ofsortedfiles)

2. ExternalMerge- Sort

4

Lecture11

1.ExternalMerge

5

Lecture11

Challenge:MergingBigFileswithSmallMemory

Howdoweefficientlymergetwosortedfileswhenbotharemuchlargerthanourmainmemorybuffer?

Lecture11>Section1>ExternalMerge

ExternalMergeAlgorithm

• Input:2sorted listsoflengthMandN

• Output: 1sortedlistoflengthM+N

• Required:Atleast3BufferPages

• IOs:2(M+N)

Lecture11>Section1>ExternalMerge

Key(Simple)IdeaTofindanelementthatisnolargerthanallelementsintwolists,one

onlyneedstocompareminimumelementsfromeachlist.

Lecture11>Section1>ExternalMerge

If:𝐴" ≤ 𝐴$ ≤ ⋯ ≤ 𝐴&𝐵" ≤ 𝐵$ ≤ ⋯ ≤ 𝐵(

Then:𝑀𝑖𝑛(𝐴", 𝐵") ≤ 𝐴/𝑀𝑖𝑛(𝐴", 𝐵") ≤ 𝐵0

fori=1….Nandj=1….M

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

7,11 20,31

23,24 25,30

Input:Twosortedfiles

Output:Onemergedsortedfile

Disk

MainMemory

Buffer1,5

2,22

F1

F2

Lecture11>Section1>ExternalMerge

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

1,5 2,22Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

5 22 1,2Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

7,11 20,31

23,24 25,30

Disk

MainMemory

Buffer

5 22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

Thisisallthealgorithm“sees”…Whichfiletoloadapagefromnext?

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

7,11

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

WeknowthatF2 onlycontainsvalues≥ 22…soweshouldloadfromF1!

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F2

7,11

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

20,31

23,24 25,30

Disk

MainMemory

Buffer

522

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F27,11

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

20,31

23,24 25,30

Disk

MainMemory

Buffer

5,722

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

20,31

23,24 25,30

Disk

MainMemory

Buffer

5,7

22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

ExternalMergeAlgorithm

Lecture11>Section1>ExternalMerge

23,24 25,30

Disk

MainMemory

Buffer

5,7

22

1,2

Input:Twosortedfiles

Output:Onemergedsortedfile

F1

F211

20,31

Andsoon…

ExternalMergeAlgorithm

Wecanmerge2listsofarbitrarylengthwithonly 3bufferpages.

IflistsofsizeMandN,thenCost: 2(M+N)IOs

Eachpageisreadonce,writtenonce

WithB+1bufferpages,canmergeBlists.How?

Lecture11>Section1>ExternalMerge

2.ExternalMergeSort

20

Lecture11>Section2

Whatyouwilllearnaboutinthissection

1. Externalmergesort(2-waysort)

2. Externalmergesortonlargerfiles

3. Optimizationsforsorting

21

Lecture11>Section2

ExternalMergeAlgorithm

• Supposewewanttomergetwosorted filesbothmuchlargerthanmainmemory(i.e.thebuffer)

•Wecanusetheexternalmergealgorithm tomergefilesofarbitrarylength in2*(N+M)IOoperationswithonly3bufferpages!

Ourfirstexampleofan“IOaware”algorithm/costmodel

Lecture11>Section2>ExternalMergeSort

WhyareSortAlgorithmsImportant?

• DatarequestedfromDBinsortedorderisextremelycommon• e.g.,findstudentsinincreasing GPA order

•Whynotjustusequicksortinmainmemory??• Whataboutifweneedtosort1TBofdatawith1GBofRAM…

Aclassicproblemincomputerscience!

Lecture11>Section2>ExternalMergeSort

Morereasonstosort…

• Sortingusefulforeliminatingduplicatecopiesinacollectionofrecords(Why?)

• SortingisfirststepinbulkloadingB+treeindex.

• Sort-merge joinalgorithminvolvessorting

Comingup…

Comingup…

Lecture11>Section2>ExternalMergeSort

Dopeoplecare?

Sortbenchmarkbearshisname

http://sortbenchmark.org

Lecture11>Section2>ExternalMergeSort

ExternalMergeSort

Lecture11>Section2>ExternalMergeSort

Sohowdowesortbigfiles?

1. Splitintochunkssmallenoughtosortinmemory(“runs”)

2. Merge pairs(orgroups)ofrunsusingtheexternalmergealgorithm

3. Keepmerging theresultingruns(eachtime=a“pass”)untilleftwithonesortedfile!

Lecture11>Section2>ExternalMergeSort

ExternalMergeSortAlgorithm(2-waysort)

27,24 3,1

Example:• 3Bufferpages• 6-pagefile

Disk MainMemory

Buffer

18,22F

33,12 55,3144,10

1. Splitintochunkssmallenoughtosortinmemory

Lecture11>Section2>ExternalMergeSort

Orangefile=unsorted

ExternalMergeSortAlgorithm(2-waysort)

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F2

33,12 55,3144,10

1. Splitintochunkssmallenoughtosortinmemory

Example:• 3Bufferpages• 6-pagefile

Lecture11>Section2>ExternalMergeSort

Orangefile=unsorted

ExternalMergeSortAlgorithm(2-waysort)

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F233,12 55,3144,10

1. Splitintochunkssmallenoughtosortinmemory

Example:• 3Bufferpages• 6-pagefile

Lecture11>Section2>ExternalMergeSort

Orangefile=unsorted

ExternalMergeSortAlgorithm(2-waysort)

27,24 3,1

Disk MainMemory

Buffer

18,22

F1

F231,33 44,5510,12

Example:• 3Bufferpages• 6-pagefile

1. Splitintochunkssmallenoughtosortinmemory

Lecture11>Section2>ExternalMergeSort

Orangefile=unsorted

ExternalMergeSortAlgorithm(2-waysort)

Disk MainMemory

BufferF1

F2

31,33 44,5510,12

AndsimilarlyforF2

27,24 3,118,2218,22 24,271,3

1. Splitintochunkssmallenoughtosortinmemory

Example:• 3Bufferpages• 6-pagefile

Lecture11>Section2>ExternalMergeSort

Eachsortedfileisacalledarun

ExternalMergeSortAlgorithm(2-waysort)

Disk MainMemory

BufferF1

F2

2.Nowjustruntheexternalmerge algorithm&we’redone!

31,33 44,5510,12

18,22 24,271,3

Example:• 3Bufferpages• 6-pagefile

Lecture11>Section2>ExternalMergeSort

CalculatingIOCost

For3bufferpages,6pagefile:

1. Splitintotwo3-pagefiles andsortinmemory1. =1R+1Wforeachfile=2*(3+3)=12IOoperations

2. Merge eachpairofsortedchunksusingtheexternalmergealgorithm1. =2*(3+3)=12IOoperations

3. Totalcost=24IO

Lecture11>Section2>ExternalMergeSort

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

18,43 24,2745,38

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

31,33 47,5510,12

18,22 23,2041,3

31,33 39,5542,46

18,23 24,271,3

48,33 44,4010,12

18,22 24,2716,31

Lecture11>Section2>ExternalMergeSort:Largerfiles

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

18,43 24,2745,38

31,33 47,5510,12

18,22 23,2041,3

31,33 39,5542,46

18,23 24,271,3

48,33 44,4010,12

18,22 24,2716,31

1.Splitintofilessmallenoughtosortinbuffer…

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Lecture11>Section2>ExternalMergeSort:Largerfiles

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

1.Splitintofilessmallenoughtosortinbuffer…andsort

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Lecture11>Section2>ExternalMergeSort:Largerfiles

Calleachofthesesortedfilesarun

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

2.Nowmergepairsof(sorted)files…theresultingfileswillbesorted!

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Lecture11>Section2>ExternalMergeSort:Largerfiles

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

3.Andrepeat…

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Disk

10,12 12,183,10

22,23 24,2718,20

33,33 38,4131,31

45,47 55,5543,44

10,12 16,181,3

23,24 24,2718,22

31,33 33,3927,31

44,46 48,5540,42

Assumewestillonlyhave3 bufferpages(Buffernotpictured)

Lecture11>Section2>ExternalMergeSort:Largerfiles

Calleachofthesestepsapass

RunningExternalMergeSortonLargerFiles

Disk

31,33 44,5510,12

27,38 43,4518,24

31,33 47,5510,12

20,22 23,413,18

39,42 46,5531,33

18,23 24,271,3

33,40 44,4810,12

22,24 27,3116,18

4.Andrepeat!

Disk

18,24 27,3110,12

43,44 45,5533,38

12,18 20,223,10

33,41 47,5523,31

18,23 24,271,3

39,42 46,5531,33

16,18 22,2410,12

33,40 44,4827,31

Disk

10,12 12,183,10

22,23 24,2718,20

33,33 38,4131,31

45,47 55,5543,44

10,12 16,181,3

23,24 24,2718,22

31,33 33,3927,31

44,46 48,5540,42

Disk

3,10 10,101,3

12,16 18,1812,12

20,22 22,2318,18

24,24 27,2723,24

31,31 31,3327,31

33,38 39,4033,33

43,44 44,4541,42

48,55 55,5546,47

Lecture11>Section2>ExternalMergeSort:Largerfiles

Simplified3-pageBufferVersionAssumeforsimplicitythatwesplitanN-pagefileintoNsingle-pagerunsandsortthese;then:

• Firstpass:MergeN/2pairsofrunseach oflength1page

• Secondpass:MergeN/4pairsofrunseachoflength2pages

• Ingeneral,forN pages,wedo 𝒍𝒐𝒈𝟐 𝑵 passes• +1fortheinitialsplit&sort

• Eachpassinvolvesreadingin&writingoutallthepages=2NIO

Unsortedinputfile

Split&sort

Merge

Merge

Sorted!

à 2N*( 𝒍𝒐𝒈𝟐 𝑵 +1)totalIOcost!

Lecture11>Section2>ExternalMergeSort:Largerfiles

UsingB+1bufferpagestoreduce#ofpasses

SupposewehaveB+1bufferpagesnow;wecan:

1. Increaselengthofinitialruns.SortB+1atatime!Atthebeginning,wecansplittheNpagesintorunsoflengthB+1andsorttheseinmemory

Lecture11>Section2>Optimizationsforsorting

2𝑁( log$ 𝑁 + 1)

IOCost:

Startingwithrunsoflength1

2𝑁( log$𝑵

𝑩 + 𝟏 + 1)

StartingwithrunsoflengthB+1

UsingB+1bufferpagestoreduce#ofpasses

SupposewehaveB+1bufferpagesnow;wecan:

2.PerformaB-waymerge.Oneachpass,wecanmergegroupsofBrunsatatime(vs.mergingpairsofruns)!

Lecture11>Section2>Optimizationsforsorting

IOCost:

2𝑁( log$ 𝑁 + 1) 2𝑁( log$𝑵

𝑩 + 𝟏 + 1)

Startingwithrunsoflength1

StartingwithrunsoflengthB+1

2𝑁( log@𝑵

𝑩 + 𝟏 + 1)

PerformingB-waymerges

Repacking

Lecture11>Section2>Optimizationsforsorting

Repackingforevenlongerinitialruns

• WithB+1bufferpages,wecannowstartwithB+1-lengthinitialruns(anduseB-waymerges)toget2𝑁( log@

𝑵𝑩A𝟏

+ 1) IOcost…

• Canwereducethiscostmorebygettingevenlongerinitialruns?

• Userepacking- producelongerinitialrunsby“merging”inbufferaswesortatinitialstage

Lecture11>Section2>Optimizationsforsorting

Repacking Example:3pagebuffer

• Startwithunsortedsingleinputfile,andload2pages

57,24 3,98

DiskMainMemory

Buffer18,22F1

10,33 44,5531,12

Lecture11>Section2>Optimizationsforsorting

F2

Repacking Example:3pagebuffer

• Taketheminimumtwovalues,andputinoutputpage

57,24 3,98

DiskMainMemory

Buffer18,22F1

10,33

44,55

31,12

Lecture11>Section2>Optimizationsforsorting

F2 31 33 10,12

m=12

Alsokeeptrackofmax(last)valueincurrentrun…

Repacking Example:3pagebuffer

• Next,repack

57,24 3,98

DiskMainMemory

BufferF1

33

Lecture11>Section2>Optimizationsforsorting

F2 31 31,3310,12

m=1244,55

18,22

Repacking Example:3pagebuffer

• Next,repack,thenloadanotherpageandcontinue!

57,24 3,98

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=1244,55

m=33

18,22

Repacking Example:3pagebuffer

• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…

3,98

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=33

18,2218,22

Wecallthesevaluesfrozen becausewecan’taddthemtothisrun…

44,55

57,24

Repacking Example:3pagebuffer

• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=55

44,55 57,24 18,22

3,98

Repacking Example:3pagebuffer

• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=55

44,55 57,24 18,22 3,98

Repacking Example:3pagebuffer

• Now,however,thesmallestvaluesarelessthanthelargest(last)inthesortedrun…

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=55

44,55 3,24 18,22 57,98

Repacking Example:3pagebuffer• Onceallbufferpageshaveafrozenvalue,orinputfileisempty,startnewrunwiththefrozenvalues

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=0

44,55 3,24 18,22

57,98

F3

Repacking Example:3pagebuffer• Onceallbufferpageshaveafrozenvalue,orinputfileisempty,startnewrunwiththefrozenvalues

DiskMainMemory

BufferF1

Lecture11>Section2>Optimizationsforsorting

F2 31,3310,12

m=0

44,55

57,98

F3

3,18 22,24

Repacking

• Notethat,forbufferwithB+1pages:• Ifinputfileissortedà nothingisfrozenà wegetasingle run!• Ifinputfileisreversesorted(worstcase)à everythingisfrozenà wegetrunsoflengthB+1

• Ingeneral,withrepackingwedonoworse thanwithoutit!

• Whatifthefileisalreadysorted?

• Engineer’sapproximation:runswillhave~2(B+1)length

~2𝑁( log@𝑵

𝟐(𝑩 + 𝟏) + 1)

Lecture11>Section2>Optimizationsforsorting

Summary

• BasicsofIOandbuffermanagement.

• WeintroducedtheIOcostmodelusingsorting.• SawhowtodomergeswithfewIOs,• Worksbetterthanmain-memorysortalgorithms.

• Describedafewoptimizationsforsorting

Lecture11

top related