sorting. csm1220 - sorting 2 introduction the objective is to take an unordered set of comparable...

39
Sorting

Upload: ambrose-walsh

Post on 31-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

Sorting

Page 2: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 2

Introduction

• The objective is to take an unordered set of comparable data items and arrange them in order.

• We will usually sort the data into ascending order — sorting into descending order is very similar.

• Data can be sorted in various ADTs, such as arrays and trees; in fact, we have already seen how a binary search tree can be used to sort data.

• There are two main sorting themes: address-based sorting and comparison-based sorting. We will focus on the latter.

Page 3: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 3

The family of sorting methods

Main sorting themes

Comparison-basedsorting

Transpositionsorting

BubbleSort

Insert andkeep sorted

Insertionsort

Treesort

Divide andconquer

QuickSort

MergeSort

ProxmapSort

RadixSort

ShellSort

Diminishingincrement

sorting

Address--basedsorting

Selectionsort

Heapsort

Priority queuesorting

Page 4: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 4

Lecture schedule

• An overview of sorting algorithms

• We will not study address-based sorting in detail because they are quite restricted in their application– One slide on Proxmap/Radix sort

• Examples from each type of comparison-based sorting algorithm– Bubble sort [transposition sorting]

– Insertion sort (already seen, see Tree notes) [insert and keep sorted]

– Selection sort [priority queue sorting]

– Shell sort [diminishing increment sorting]

– Merge sort and Quick sort [divide and conquer sorting]

Page 5: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 5

Bubble sort

• A pretty dreadful type of sort!

• However, the code is small:5 3 2 4

3 5 2 4

3 2 5 4

3 2 4 5

end of one inner loop

for (int i=arr.length; i>0; i--) { for (int j=1; j<i; j++) { if (arr[j-1] > arr[j]) { temp = arr[j-1]; arr[j-1] = arr[j]; arr[j] = temp; } }}

5 ‘bubbled’ to the correct position2 3 4 5remaining elements put in place

Page 6: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 6

Insertion sort

Tree Insertion Sort• This is inserting into a normal tree structure: i.e. data are put into the

correct position when they are inserted.

• We’ve already seen this happening with a tree, requiring a find and an insert.

• We know the time complexity for one insert is O(logN) + O(1) = O(logN); therefore to insert N items will have a complexity of O(NlogN).

Array Insertion Sort• The array must be sorted; insertion requires a find + insert.

• The insertion time complexity is O(logN) + O(N) = O(N) for one item, therefore to insert N items will have a complexity of O(N2).

Page 7: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 7

SelectionSort

• SelectionSort uses an array implementation of the Priority Queue ADT (not the heap implementation we will study)

• The elements are inserted into the array as they arrive and then extracted in descending order into another array

• The major disadvantage is the performance overhead of finding the largest element at each step; we have to traverse over the entire array to find it

13 2 15 4

13 4 215

Page 8: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 8

SelectionSort (2)

• In practice, we use the same array for the Priority Queue and the output results

• General algorithm:1. Initialise PQLast to the last index of the Priority Queue

2. Search from the start of the array to PQLast for the largest element: call its position front

3. Swap element indexed by front with element indexed by PQLast

4. Decrement PQLast by one

5. Repeat steps 2 – 4

Page 9: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 9

SelectionSort (3)

12 913 2 15 4

PQLastfront

12 1513 2 9 4

PQLastfront

13 1512 2 9 4

PQLastfront

etc.

Page 10: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 10

Time complexity of SelectionSort

• The main operations being performed in the algorithm are:1. Comparisons to find the largest element in the Priority Queue subarray

(to find front index): there are ‘n + (n-1) + … + 2 + 1’ comparisons, i.e., n/2 (n+1) = (n2 + n) / 2 so this is an O(n2) operation

2. Swapping elements between front and PQLast: n-1 exchanges are performed so this is an O(n) operation

• The dominant operation (time-wise) gives the overall time complexity, i.e., O(n2)

• Although this is an O(n2) algorithm, its advantage overO(n log n) sorts is its simplicity

Page 11: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 11

Time complexity of SelectionSort (2)

• For very small sets of data, SelectionSort may actually be more efficient than O(n log n) algorithms

• This is because many of the more complex sorts have a relatively high level of overhead associated with them, e.g., recursion is expensive compared with simple loop iteration

• This overhead might outweigh the gains provided by a more complex algorithm where a small number of data elements is being sorted

• SelectionSort does better than BubbleSort as fewer swaps are required, although the same number of comparison operations are performed (each swap puts an element in its correct place)

Page 12: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 12

Shell sort [diminishing increment sorting]

• Named after D.L. Shell! But it is also rather like shrinking shells of sorting, for Insertion Sort.

• Shell sort aims to reduce the work done by insertion sort (i.e. scanning a list and inserting into the right position).

• Do the following:– Begin by looking at the lists of elements x1 elements apart and sort those

elements by insertion sort

– Reduce the number x1 to x2 so that x1 is not a multiple of x2 and repeat these two steps until x2 = 1.

• It can be shown that this approach will sort a list with a time complexity of O(N1.25).

Page 13: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 13

Shell Sort Illustration

• Consider sorting the following list by Shell sort:

452349679181224

9

24

45

91

6

8

23

4

7

12

9 6 424 8 745231291

4

8

9

12

45

6

7

23

24

91

4 6 8 7 92312244591

4 6 7 8 91223244591

GAP=3 GAP=2 GAP=1

Page 14: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 14

Comparing O(N2), O(N1.25) and O(N)O(N) O(N 1̂.25) O(N 2̂)

1 1 12 2.378414 43 3.948222 94 5.656854 165 7.476744 256 9.390507 367 11.38604 498 13.45434 649 15.58846 81

10 17.78279 10011 20.03276 12112 22.33452 14413 24.68478 16914 27.08071 19615 29.51985 22516 32 25617 34.51923 28918 37.07581 32419 39.66815 36120 42.29485 40021 44.9546 44122 47.64621 48423 50.36859 52924 53.12073 57625 55.9017 625

0

100

200

300

400

500

600

700

1 3 5 7 9 11 13 15 17 19 21 23 25

Size of input (N)

Tim

e (T

)

O(N)

O(N^1.25)

O(N^2)

Page 15: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 15

How do you choose the gap size?

• The idea of the decreasing gap size is that the list becomes more and more sorted each time the gap size is reduced, therefore you don’t (for example) want to have a gap size of 4 followed by a gap size of 2 because you’ll be sorting half the numbers a second time.

• There is no formal proof of a good initial gap size, but about a 10th the size of N is a reasonable start.

• Try to use prime numbers as your gap size, or odd numbers if you can not readily get a list of primes (though note gaps of 9, 7, 5, 3, 1 will be doing less work when gap=3).

Page 16: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 16

Divide and conquer sorting

MergeSort QuickSort

Page 17: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 17

Divide ...

5 1 4 2 10 3 9 15

5 1 4 2 10 3 9 15

5 1 4 2 10 3 9 15

5 1 4 2 10 3 9 15

Page 18: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 18

and conquer

1 2 3 4 5 9 10 15

1 2 4 5 3 9 10 15

5 1 4 2 10 3 9 15

1 5 2 4 3 10 9 15

Page 19: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 19

MergeSort [divide and conquer sorting]

• For MergeSort an initial array is repeatedly divided into halves (usually each is a separate array), until arrays of just one or zero elements remain

• At each level of recombination, two sorted arrays are merged into one

• This is done by copying the smaller of the two elements from the sorted arrays into the new array, and then moving along the arrays

1 13 24 26 2 15 27 38

Aptr Bptr Cptr

Page 20: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 20

Merging

1 13 24 26 2 15 27 38 1

Aptr Bptr Cptr

1 13 24 26 2 15 27 38 1 2

Aptr Bptr Cptr

1 13 24 26 2 15 27 38 1 2 13

Aptr Bptr Cptr

etc.

Page 21: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 21

Merge Sort In Pseudocode

• Basic idea in pseudocode:

method mergeSort(array) is: if len(array) == 0 or len(array) == 1 then: // terminating case

return array else:

end = len(array)center = end / 2left = mergeSort( array[0:center] ) // recursive call

right = mergeSort( array[center:end] ) // recursive call

return merge(left,right) // method that merges lists

end if end method

Page 22: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 22

Analysis of MergeSort

• Let the time to carry out a MergeSort on n elements be T(n)

• Assume that n is a power of 2, so that we always split into equal halves (the analysis can be refined for the general case)

• For n=1, the time is constant, so we can take T(1) = 1

• Otherwise, the time T(n) is the time to do two MergeSorts on n/2 elements, plus the time to merge, which is linear

• So, T(n) = 2 T(n/2) + n

• Divide through by n to get T(n)/n = T(n/2)/(n/2) + 1

• Replacing n by n/2 gives, T(n/2)/(n/2) = T(n/4)/(n/4) + 1

• And again gives, T(n/4)/(n/4) = T(n/8)/(n/8) + 1

Page 23: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 23

Analysis of MergeSort (2)

• We continue until we end up with T(2)/2 = T(1)/1 + 1

• Since n is divided by 2 at each step, we have log2n steps

• Now, substituting the last equation in the previous one, and working back up to the top gives T(n)/n = T(1)/1 + log2n

• That is, T(n)/n = log2n + 1

• So T(n) = n log2n + n = O(n log n)

• Although this is an O(n log n) algorithm, it is hardly ever used for main memory sorts because it requires linear extra memory

Page 24: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 24

QuickSort [divide and conquer sorting]

• As its name implies, QuickSort is the fastest known sorting algorithm in practice (address-based sorts can be faster)

• It was devised by C.A.R. Hoare in 1962• Its average running time is O(n log n) and it is very fast• It has worst-case performance of O(n2) but this can be made very

unlikely with little effort• The idea is as follows:

1. If the number of elements to be sorted is 0 or 1, then return2. Pick any element, v (this is called the pivot)

3. Partition the other elements into two disjoint sets, S1 of elements v, and S2 of elements > v

4. Return QuickSort (S1) followed by v followed by QuickSort (S2)

Page 25: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 25

QuickSort example

5 1 4 2 10 3 9 15 12

Pick the middle element as the pivot, i.e., 10

5 1 4 2 3 9 15 12

Partition into the two subsets below

Sort the subsets

1 2 3 4 5 9 12 15

Recombine with the pivot

1 2 3 4 5 9 10 12 15

Page 26: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 26

QuickSort Example (full)

5 1 4 2 10 3 9 15 12

5 1 4 2 3 9 15 1210

Unsorted List

Partition1

1 2 3 4 5 9 12 15

1 32

1 2 3 4 5 9 10 12 15

5 9

Partition2

Partition3

Sorted List

Page 27: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 27

QuickSort Pseudocode

• Basic idea in pseudocode (compare with mergeSort)

method quickSort(array) is: if len(array) == 0 or len(array) == 1 then: // terminating case

return array else:

pivotIndx = int( random() * len(array) )pivot = array(pivotIndx) // just choose a random pivot

(left, right) = partition(array, pivot) // get left and right

return quickSort(left) + pivot + quickSort(right) end if end method

Page 28: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 28

Partitioning example

5 11 4 25 10 3 9 15 12

Pick the middle element as the pivot, i.e., 10

Move the pivot out of the way by swapping it with the first element

10 11 4 25 5 3 9 15 12

swapPos

Step along the array, swapping small elements into swapPos

10 4 11 25 5 3 9 15 12

swapPos

Page 29: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 29

Partitioning example (2)

10 4 5 25 11 3 9 15 12

swapPos

10 4 5 3 11 25 9 15 12

swapPos

10 4 5 3 9 25 11 15 12

swapPos

Page 30: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 30

Pseudo code for partitioning

pivotPos = middle of array a;swap a[pivotPos] with a[first]; // Move the pivot out of the wayswapPos = first + 1;for each element in the array from swapPos to last do: // If the current element is smaller than pivot we // move it towards start of array if (a[currentElement] < a[first]): swap a[swapPos] with a[currentElement]; increment swapPos by 1;// Now move the pivot back to its rightful placeswap a[first] with a[swapPos-1];return swapPos-1; // Pivot position

Page 31: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 31

Analysis of QuickSort

• We assume a random choice of pivot

• Let the time to carry out a QuickSort on n elements be T(n)

• We have T(0) = T(1) = 1

• The running time of QuickSort is the running time of the partitioning (linear in n) plus the running time of the two recursive calls of QuickSort

• Let i be the number of elements in the left partition, then T(n) = T(i) + T(n–i–1) + cn (for some constant c)

Page 32: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 32

Worst-case analysis

• If the pivot is always the smallest element, then i = 0 always

• We ignore the term T(0) = 1, so the recurrence relation isT(n) = T(n–1) + cn

• So, T(n–1) = T(n–2) + c(n–1) and so on until we getT(2) = T(1) + c(2)

• Substituting back up gives T(n) = T(1) + c(n + … + 2) = O(n2)

• Notice that this case happens if we always take the pivot to be the first element in the array and the array is already sorted

• So, in this extreme case, QuickSort takes O(n2) time to do absolutely nothing!

Page 33: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 33

Best-case analysis

• In the best case, the pivot is in the middle

• To simplify the equations, we assume that the two subarrays are each exactly half the length of the original (a slight overestimate which is acceptable for big-Oh calculations)

• So, we get T(n) = 2T(n/2) + cn

• This is very similar to the formula for MergeSort, and a similar analysis leads to T(n) = cn log2n + n = O(n log n)

Page 34: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 34

Average-case analysis

• We assume that each of the sizes of the left partition are equally likely, and hence have probability 1/n

• With this assumption, the average value of T(i), and hence also of T(n–i–1), is (T(0) + T(1) + … + T(n–1))/n

• Hence, our recurrence relation becomesT(n) = 2(T(0) + T(1) + … + T(n–1))/n + cn

• Multiplying by n givesnT(n) = 2(T(0) + T(1) + … + T(n–1)) + cn2

• Replacing n by n–1 gives(n–1)T(n–1) = 2(T(0) + T(1) + … + T(n–2)) + c(n–1)2

• Subtracting the last equation from the previous one givesnT(n) – (n–1)T(n–1) = 2T(n–1) + 2cn – c

Page 35: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 35

Average-case analysis (2)

• Rearranging, and dropping the insignificant c on the end, gives nT(n) = (n+1)T(n–1) + 2cn

• Divide through by n(n+1) to getT(n)/(n+1) = T(n–1)/n + 2c/(n+1)

• Hence, T(n–1)/n = T(n–2)/(n–1) + 2c/n and so on down toT(2)/3 = T(1)/2 + 2c/3

• Substituting back up givesT(n)/(n+1) = T(1)/2 + 2c(1/3 + 1/4 + … + 1/(n+1))

• The sum in brackets is about loge(n+1) + – 3/2, where is Euler’s constant, which is approximately 0.577

• So, T(n)/(n+1) = O(log n) and T(n) = O(n log n)

Page 36: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 36

Some observations about QuickSort

• We have seen that a consistently poor choice of pivot can lead to O(n2) time performance

• A good strategy is to pick the middle value of the left, centre, and right elements

• For small arrays, with n less than (say) 20, QuickSort does not perform as well as simpler sorts such as SelectionSort

• Because QuickSort is recursive, these small cases will occur frequently

• A common solution is to stop the recursion at n = 10, say, and use a different, non-recursive sort

• This also avoids nasty special cases, e.g., trying to take the middle of three elements when n is one or two

Page 37: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 37

Address-Based Sorting

• Proxmap uses techniques similar to hashing to assign an element to its correctly sorted position in a container such as an array using a hash function

• The algorithms are generally complex, and very often only suitable for certain kinds of data

• You need to find a suitable hashing function that will distribute data fairly evenly

• Clashes are dealt with using a comparison based sort, so the more clashes there are the further the time complexity moves away from O(n) (see notes view for more details)

Page 38: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 38

Radix / Bucket / Bin sort• Radix Sort

– In its favour, it is an O(n) sort. That makes it the fastest sort we have investigated.

– However it requires at least 2n space in which to operate, and is in principle exponential in its space complexity..

– A radix sort makes one pass through the data for each atom in the key. If the key is three character uppercase alphabetic strings, such as ABC, then A is an atom. In this case, there would be three passes through the data.

– The first pass sorts on the low order element of the key (the C in our example). The second pass sorts on the next atom, in order of importance (the B in our example). Each pass progresses toward the high order atom of the key.

– In each such pass, the elements of the array are distributed into k buckets. For our alphabetic key example, A's go into the 'A' bucket, B's into the 'B' bucket, and so on. These distribution buckets are then gathered, in order, and placed back in the original array. The next pass is then executed. When a pass has been made on the high order atom in the key, the array will be sorted.

Page 39: Sorting. CSM1220 - Sorting 2 Introduction The objective is to take an unordered set of comparable data items and arrange them in order. We will usually

CSM1220 - Sorting 39

Radix Sort Example

123234345134245356156267378121232343

345356378343

156

123134156121

234245267232

Bins 0, 4-9 are empty

123121

134

Bins 0-1,4,6-9Are empty

234232

245

267

Bins 0-2,5,7-9Are empty

121

123

Bins 0,2,4-9Are empty

156

134

232

234

Bins 0-1,3,5-9Are empty

.

.

.

col1 col2 col3