6.3.1 heapsort

1

Heapsort

Idea: dos fases: 1. Construccion del heap 2. Output del heap

Para ordenar numeros ascendentemente: mayor valor => mayor prioridad (el mayor esta en la raiz)

Heapsort es un procedimiento in-situ

2

Recordemos Heaps: cambio en la definicion

Heap con orden reverso:

• Para cada nodo x y cada sucesor y de x se cumple que m(x) m(y),

• left-complete, significa que los niveles se llenan partiendo por la raíz y cada nivel de izquierda a derecha

• Implementación en arreglo, donde los nodos se guardan en orden (de izquierda a derecha).

3

Definamos un segmento de heapComo un segmento de arreglo a[ i..k ] ( 1 i k <=n ) donde se cumple: para todo j de {i,...,k} m(a[ j ]) m(a[ 2j ]) if 2j k y m(a[ j ]) m(a[ 2j+1]) if 2j+1 k

Si a[i+1..n] es un segmento de heap podemos facilmente convertir a[i…n] en un segmento de heap tambien „hundiendo“ a[ i ].

4

Primera Fase:

1. Construccion del Heap: métido simple : insert n-veces Cost0: O(n log n). haciendolo mejor: considere el arreglo a[1 … n ]

como un heap que está bien ordenado a la derecha. Los elementos de la mitad izquierda aun no ordenados se dejan “caer” en la siguiente secuencia: a[n div 2] … a[2] a[1] (los elementos a[n] … a[n div 2 +1] están ya en las hojas )

HH The leafs of the heap

5

Segunda Fase

2. Output del heap heap: sacar n-veces el maximo (en la raíz), e intercambiarlo

con ultimo elemento del heap, dejarlo caer. ® El Heap se reduced en un elemento y el mayor queda al final.

Repetir este proceso hasta que haya solo un elemento en el heap (el menor)

costo: O(n log n).

Heap Ordered elements

Heap

Ordered elements

6

Cost calculation

Sea k = [log n+1] la altura del heap que se está construyendo en la fase 1 )

Para un elemento en el nivel j, suponiendo que los niveles j+1 hasta k estan construidos, el costo máximo de incluirlo en el segmento será: k – j.

Además en cada nivel j hay 2j elementosEn suma: {j=0,…,k} (k-j)•2j = 2k • {i=0,…,k} i/2i =2 • 2k = O(n).

7

advantage:

The new construction strategy is more efficient !

Usage: when only the m biggest elements are required:

1. construction in O(n) steps. 2. output of the m biggest elements in O(m•log n) steps. total cost: O( n + m•log n).

8

Addendum: Sorting with search trees

Algorithm:1. Construction of a search tree (e.g. AVL-tree) with the

elements to be sorted by n insert opeartions.2. Output of the elements in InOrder-sequence.

® Ordered sequence.

cost: 1. O(n log n) with AVL-trees, 2. O(n). in total: O(n log n). optimal!

9

7.2 External Sorting

Problem: Sorting big amount of data, as in external searching, stored in blocks (pages).

efficiency: number of the access to pages should be kept low!

Strategy: Sorting algorithm which processes the data sequentially (no frequent page exchanges): MergeSort!

General form for Mergemergesort(S) # retorna el conjunto S ordenado { if(S es vacío o tiene sólo 1 elemento) return(S); else { Dividir S en dos mitades A y B; A'=mergesort(A); B'=mergesort(B); return(merge(A',B')); } }

10

MergeSort en arreglo: algoritmo O(nlog2n) void mergesort(Comparable[]x,int ip,int iu){ if(ip>=iu) return; //caso base int im=(ip+iu)/2; //índice de mitad mergesort(x,ip,im); //ordenar 1ª mitad mergesort(x,im+1,iu); //ordenar 2ª mitad merge(x,ip,im,iu); //mezclar mitades } void merge(Comparable[]x,int ip,int im,int iu){ Comparable[]a=new Comparable[iu+1]; for(int i=ip,i1=ip,i2=im+1; i<=iu; ++i) if(i1<=im &&(i2>iu || x[i1].compareTo(x[i2])<0)) a[i]=x[i1++]; else a[i]=x[i2++]; for(int i=ip; i<=iu; ++i) x[i]=a[i]; }

Análisis informal Mergesort se invoca recursivamente 2log2n veces (tantas veces

como se puede dividir el arreglo por la mitad) Mergesort invoca a merge una vez Merge realiza O(n) comparaciones (en cada recorrido) En total: O(nlog2n) comparaciones ¿Espacio?

Mergesort requiere espacio adicional para los n elementos Se puede ahorrar espacio adicional a costa del tiempo de

ejecución (propuesto)

13

Meregesort en Archivos: Start: se tienen n datos en un archivo g1, divididos en páginas de tamaño b:

Page 1: s1,…,sb Page 2: sb+1,…s2b …Page k: s(k-1)b+1 ,…,sn

( k = [n/b]+ )Si se procesan secuencialmente se hacen k accesos a

paginas, no n.

14

Variacion de MergeSort para external sorting

MergeSort: Divide-and-Conquer-Algorithm

Para external sorting: sin el paso divide, solo merge.

Definicion: run := subsecuencia ordenada dentro de un archivo.

Estrategia: by merging increasingly bigger generated runs until everything is sorted.

15

Algoritmo1. Step: Generar del input file g1

„starting runs“ y distribuirlas en dos archivos f1 and f2,

con el mismo numero de runs (1) en cada uno

(for this there are many strategies, later).

Ahora: use 4 files f1, f2, g1, g2.

16

2. Step (main step):while (number of runs > 1) {• Merge each two runs from f1 and f2 to a double

sized run alternating to g1 und g2, until there are no more runs in f1 and f2.

• Merge each two runs from g1 and g2 to a double sized run alternating to f1 and f2, until there are no more runs in g1 und g2.

}

Each loop = two phases

17

Example:Start: g1: 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50 1st. step (length of starting run= 1): f1: 64 | 3 | 79 | 19 | 67 | 8 | 50 f2: 17 | 99 | 78 | 13 | 34 | 12Main step, 1st. loop, part 1 (1st. Phase ): g1: 17, 64 | 78, 79 | 34, 67 | 50 g2: 3, 99 | 13, 19 | 8, 121st. loop, part 2 (2nd. Phase): f1: 3, 17, 64, 99 | 8, 12, 34, 67 | f2: 13, 19, 78, 79 | 50 |

18

Example continuation

1st. loop, part 2 (2nd. Phase): f1: 3, 17, 64, 99 | 8, 12, 34, 67 | f2: 13, 19, 78, 79 | 50 |

2nd. loop, part 1 (3rd. Phase): g1: 3, 13, 17, 19, 64, 78, 79, 99 | g2: 8, 12, 34, 50, 67 |2nd. loop, part 2 (4th. Phase): f1: 3, 8, 12, 13, 17, 19, 34, 50, 64, 67, 78, 79, 99 | f2:

19

Implementation:

For each file f1, f2, g1, g2 at least one page of them is stored in principal memory (RAM), even better, a second one might be stored as buffer.

Read/write operations are made page-wise.

Problema. Ordenar un archivo (suponiendo que sólo N líneas caben en memoria) Solución. Algoritmo de “Cascada”. Nº de líneas: N N N N archivo: ... ordenar arreglo grabar archivo A1 merge archivo A2 merge archivo A1 . . . archivo A1/2

ordenamiento de un archivo de texto import java.io.*; class Sortfile { //arreglo para N líneas protected static int N=100, n=0; protected String[]linea=new String[N]; //Leer desde (posición del cursor de) archivo x //un máximo de z líneas y guardarlas en arreglo y. //Entregar también nº de líneas leídas. static public int leerLineas (BR x,String[]y,int z)throws IOException{ int i; for(i=0; i<z && (y[i]=x.readLine())!=null; ++i); return i; }

public void main(String[]args)throws IOException{ //grabar archivo auxiliar vacio PW B=new PW(new FileWriter(“A2.txt”)); B.close(); //abrir archivo y nombrar archivos auxilares BR A=new BR(new FileReader(args[0])); String in="A2.txt", out="A1.txt"; //repetir hasta terminar archivo while(true){ //leer n líneas de archivo(max N) en arreglo n=leerLineas(A,linea,N); if(n==0) break; //ordenar arreglo de n líneas quicksort(linea,0,n-1); //merge de arreglo y archivo in(resultado en out) merge(linea,n,in,out); //intercambiar rol de archivos String aux=in; in=out; out=aux; } A.close(); U.println("resultado: archivo "+in); }

//merge de arreglo y archivo ordenados static public void merge (String[]x,int n,String y,String z) throws IOException{ PW O=new PW(new FileWriter(z)); BR I=new BR(new FileReader(y)); //obtener primeros elementos int i=0; String s=I.readLine(); //repetir hasta terminar arreglo y archivo while(i<n || s!=null) //grabar menor (o los que queden) y avanzar if(i<n && (s==null || x[i].compareTo(s)<=0)){ O.println(x[i]); ++i;//inc índice }else{ O.println(s); s=I.readLine();//sgte línea } I.close(); O.close(); }}

24

Costs

Page accesses during 1. step and each phase: O(n/b)

In each phase we divide the number of runs by 2, thus:

Total number of accesses to pages: O((n/b) log n),when starting with runs of length 1.

Internal computing time in 1 step and each phase is: O(n).

Total internal computing time: O( n log n ).

25

Two variants of the first step: creation of the start runs

• A) Direct mixing sort in primary memory („internally“) as many

data as possible, for example m data sets® First run of a (fixed!) length m, thus r := n/m starting runs. Then we have the total number of page accesses: O( (n/b) log(r) ).

26

Two variants of the first step: creation of the start runs

• B) Natural mixing Creates starting runs of variable length.

Advantage: we can take advantage of ordered subsequences that the file may contain

Noteworthy: starting runs can be made longer by using the replacement-selection method by having a bigger primary storage !

27

Replacement-Selection

Read m data from the input file in the primary memory (array).

repeat { mark all data in the array as „now“. start a new run. while there is a „now“ marked data in the array {• select the smallest (smallest key) from all „now“ marked

data,• print it in the output file,• replace the number in the array with a number read from

the input file (if there are still some) mark it „now“ if it is bigger or equal to the last outputted data, else mark it as „not now“.

}}Until there are no data in the input file.

28

Example: array in primary storage with capacity of 3The input file has the following data:

64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50In the array: („not now“ data written in parenthesis)

Runs : 3, 17, 64, 78, 79, 99 | 13, 19, 34, 67 | 8, 12, 50

64 17 3

64 17 9964 79 9978 79 99(19) 79 99(19) (13) 99(19) (13) (67)

8 12 50

12 50

50

19 13 6719 34 67(8) 34 67

(8) (12) 67(8) (12) (50)

29

Implementation:

In an array:• At the front: Heap for „now“ marked data,• At the back: refilled „not now“ data.

Note: all „now“ elements go to the current generated run.

30

Expected length of the starting runs using the replace-select method:

• 2•m • (m = size of the array in the primary storage = number of data that fit into primary storage) by equally probabilities distribution

• Even bigger if there is some previous sorting!

31

Multi-way merging

Instead of using two input and two output files (alternating f1, f2 and g1, g2)

Use k input and k output files, in order to me able to merge always k runs in one.

In each step: take the smallest number among the k runs and output it to the current output file.

32

Cost: In each phase: number of runs is devided by k, Thus, if we have r starting runs we need only logk(r)

phases (instead of log2(r)). Total number of accesses to pages: O( (n/b) logk(r) ).

Internal computing time for each phase: O(n log2 (k))Total internal computing time: O( n log2(k) logk(r)) = O( n log2(r) ).

6.3.1 heapsort

Documents

elements heap

runs of length

elements4first phase

g1 und g2

file f1

step length of starting

o n log n

onb log n