cache-oblivious algorithms and data structuresgerth/slides/swat04invited.pdfcache-oblivious...

71
Cache-Oblivious Algorithms and Data Structures Gerth Stølting Brodal University of Aarhus 1

Upload: others

Post on 12-Feb-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-ObliviousAlgorithms and Data Structures

Gerth Stølting BrodalUniversity of Aarhus

"! #

1

Page 2: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Outline

• Motivation– A typical workstation– A trivial program

• Memory models– I/O model– Ideal cache model

• Basic cache-oblivious algorithms– Matrix multiplication– Search trees– Sorting

• Some experimental results

• Conclusion

Cache-Oblivious Algorithms and Data Structures 2

Page 3: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Typical Workstation

Cache-Oblivious Algorithms and Data Structures 3

Page 4: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Customizing a Dell 650

www.dell.dk

www.intel.com

Processor speed 2.4 – 3.2 GHzL3 cache size 0.5 – 2 MBMemory 1/4 – 4 GBHard Disk 36 GB – 146 GB

7.200 – 15.000 RPMCD/DVD 8 – 48x

L2 cache size 256 – 512 KBL2 cache line size 128 BytesL1 cache line size 64 BytesL1 cache size 16 KB

Do we want to know ?

Cache-Oblivious Algorithms and Data Structures 4

Page 5: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Customizing a Dell 650

www.dell.dk

www.intel.com

Processor speed 2.4 – 3.2 GHzL3 cache size 0.5 – 2 MBMemory 1/4 – 4 GBHard Disk 36 GB – 146 GB

7.200 – 15.000 RPMCD/DVD 8 – 48x

L2 cache size 256 – 512 KBL2 cache line size 128 BytesL1 cache line size 64 BytesL1 cache size 16 KB

Do we want to know ?

Cache-Oblivious Algorithms and Data Structures 4

Page 6: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Hierarchical Memory Basics

CPU L1 L2 A

R

M

Increasingaccess timeand space

L3

B1

B4

B3

B2

Disk

• Data moved between adjacent memory levels in blocks

Cache-Oblivious Algorithms and Data Structures 5

Page 7: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program

for (i=0; i+d<n; i+=d) A[i]=i+d;A[i]=0;

for (i=0, j=0; j<8*1024*1024; j++) i=A[i];

d

A

n

Cache-Oblivious Algorithms and Data Structures 6

Page 8: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program (cont.) d = 1

0

20

40

60

80

100

120

140

160

180

200

0 5 10 15 20 25

Sec

onds

log n

RAM : n ≈ 225 ≡ 128MB

Cache-Oblivious Algorithms and Data Structures 7

Page 9: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program (cont.) d = 1

0

0.5

1

1.5

2

2.5

3

2 4 6 8 10 12 14 16 18 20

Sec

onds

log n

L1 : n ≈ 212 ≡ 16KB L2 : n ≈ 216 ≡ 256KB

Cache-Oblivious Algorithms and Data Structures 8

Page 10: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program (cont.) n = 224

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20 25

Sec

onds

log d

Cache line d = 23 ≡ 32 Bytes

Do we want to know ?

Cache-Oblivious Algorithms and Data Structures 9

Page 11: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program (cont.) n = 224

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20 25

Sec

onds

log d

Cache line d = 23 ≡ 32 Bytes

Do we want to know ?

Cache-Oblivious Algorithms and Data Structures 9

Page 12: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

A Trivial Program (cont.)— If you want to know...

Experiments were performed on a DELL 8000, Pentium III,850 MHz, 128 MB RAM, running Linux 2.4.2, and usinggcc version 2.96 with optimization -O3

L1 instruction and data caches

• 4-way set associative, 32-byte line size

• 16 KB instruction cache and 16 KB write-back data cache

L2 level cache

• 8-way set associative, 32-byte line size

• 256 KB

www.Intel.com

Cache-Oblivious Algorithms and Data Structures 10

Page 13: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem

• Memory hierarchy has become a fact of life

• Accessing non-local storage may take a very long time

• Good locality is important for achieving high performance

Latency Relativeto CPU

Register 0.5 ns 1

L1 cache 0.5 ns 1-2

L2 cache 3 ns 2-7

DRAM 150 ns 80-200

TLB 500+ ns 200-2000

Disk 10 ms 10

Increasing

Cache-Oblivious Algorithms and Data Structures 11

Page 14: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem• Modern hardware is not uniform — many different parameters

– Number of memory levels– Cache sizes– Cache line/disk block sizes– Cache associativity– Cache replacement strategy– CPU/BUS/memory speed

• Programs should ideally run for many different parameters– by knowing many of the parameters at runtime– by knowing few essential parameters– ignoring the memory hierarchies practice

• Programs are executed on unpredictable configurations– Generic portable and scalable software libraries– Code downloaded from the Internet, e.g. Java applets– Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures 12

Page 15: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem• Modern hardware is not uniform — many different parameters

– Number of memory levels– Cache sizes– Cache line/disk block sizes– Cache associativity– Cache replacement strategy– CPU/BUS/memory speed

• Programs should ideally run for many different parameters

– by knowing many of the parameters at runtime– by knowing few essential parameters– ignoring the memory hierarchies practice

• Programs are executed on unpredictable configurations– Generic portable and scalable software libraries– Code downloaded from the Internet, e.g. Java applets– Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures 12

Page 16: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem• Modern hardware is not uniform — many different parameters

– Number of memory levels– Cache sizes– Cache line/disk block sizes– Cache associativity– Cache replacement strategy– CPU/BUS/memory speed

• Programs should ideally run for many different parameters– by knowing many of the parameters at runtime– by knowing few essential parameters– ignoring the memory hierarchies

practice

• Programs are executed on unpredictable configurations– Generic portable and scalable software libraries– Code downloaded from the Internet, e.g. Java applets– Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures 12

Page 17: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem• Modern hardware is not uniform — many different parameters

– Number of memory levels– Cache sizes– Cache line/disk block sizes– Cache associativity– Cache replacement strategy– CPU/BUS/memory speed

• Programs should ideally run for many different parameters– by knowing many of the parameters at runtime– by knowing few essential parameters– ignoring the memory hierarchies practice

• Programs are executed on unpredictable configurations– Generic portable and scalable software libraries– Code downloaded from the Internet, e.g. Java applets– Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures 12

Page 18: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Algorithmic Problem• Modern hardware is not uniform — many different parameters

– Number of memory levels– Cache sizes– Cache line/disk block sizes– Cache associativity– Cache replacement strategy– CPU/BUS/memory speed

• Programs should ideally run for many different parameters– by knowing many of the parameters at runtime– by knowing few essential parameters– ignoring the memory hierarchies practice

• Programs are executed on unpredictable configurations– Generic portable and scalable software libraries– Code downloaded from the Internet, e.g. Java applets– Dynamic environments, e.g. multiple processes

Cache-Oblivious Algorithms and Data Structures 12

Page 19: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Outline

• Motivation– A typical workstation– A trivial program

• Memory models– I/O model– Ideal cache model

• Basic cache-oblivious algorithms– Matrix multiplication– Search trees– Sorting

• Some experimental results

• Conclusion

Cache-Oblivious Algorithms and Data Structures 13

Page 20: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Hierarchical Memory Models— many parameters

DiskCPU L1 L2 A

R

M

Increasingaccess timeand space

L3

• Limited success since model to complicated

Cache-Oblivious Algorithms and Data Structures 14

Page 21: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

I/O Model — two parameters

Aggarwal and Vitter 1988

CPU

Memory

I/O

cache

M

B

• Measure number of block transfersbetween two memory levels

• Bottleneck in many computations

• Very successful (simplicity)

Limitations

• Parameters B and M must be known

• Does not handle multiple memory levels

• Does not handle dynamic M

Cache-Oblivious Algorithms and Data Structures 15

Page 22: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

I/O Model — two parameters

Aggarwal and Vitter 1988

CPU

Memory

I/O

cache

M

B

• Measure number of block transfersbetween two memory levels

• Bottleneck in many computations

• Very successful (simplicity)

Limitations

• Parameters B and M must be known

• Does not handle multiple memory levels

• Does not handle dynamic M

Cache-Oblivious Algorithms and Data Structures 15

Page 23: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Ideal Cache Model — no parameters!?Frigo, Leiserson, Prokop, Ramachandran 1999

• Program with only one memory

• Analyze in the I/O model forCPU

Memory

B

M

I/O

cache

• Optimal off-line cache replacementstrategy arbitrary B and M

Advantages

• Optimal on arbitrary level ⇒ optimal on all levels

• Portability, B and M not hard-wired into algorithm

• Dynamic changing parameters

Cache-Oblivious Algorithms and Data Structures 16

Page 24: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Ideal Cache Model — no parameters!?Frigo, Leiserson, Prokop, Ramachandran 1999

• Program with only one memory

• Analyze in the I/O model forCPU

Memory

B

M

I/O

cache

• Optimal off-line cache replacementstrategy arbitrary B and M

Advantages

• Optimal on arbitrary level ⇒ optimal on all levels

• Portability, B and M not hard-wired into algorithm

• Dynamic changing parameters

Cache-Oblivious Algorithms and Data Structures 16

Page 25: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Justification of the Ideal-Cache ModelFrigo, Leiserson, Prokop, Ramachandran 1999

Optimal replacement LRU + 2 × cache size ⇒ at most 2 ×cache misses Sleator and Tarjan, 1985

CorollaryTM,B(N) = O(T2M,B(N)) ⇒ #cache misses using LRU is O(TM,B(N))

Two memory levelsOptimal cache-oblivious algorithm satisfying TM,B(N) = O(T2M,B(N))⇒ optimal #cache misses on each level of a multilevel LRU cache

Fully associativity cacheSimulation of LRU

• Direct mapped cache• Explicit memory management• Dictionary (2-universal hash functions) of cache lines in memory• Expected O(1) access time to a cache line in memory

Cache-Oblivious Algorithms and Data Structures 17

Page 26: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Outline

• Motivation– A typical workstation– A trivial program

• Memory models– I/O model– Ideal cache model

• Basic cache-oblivious algorithms– Matrix multiplication– Search trees– Sorting

• Some experimental results

• Conclusion

Cache-Oblivious Algorithms and Data Structures 18

Page 27: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Warm-up : Scanning

sum = 0

for i = 1 to N do sum = sum + A[i]

O(

N

B

)

I/Os

N

B

A

Corollary Cache-oblivious selection requires O(N/B) I/OsHoare 1961 / Blum et al. 1973

Cache-Oblivious Algorithms and Data Structures 19

Page 28: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Warm-up : Scanning

sum = 0

for i = 1 to N do sum = sum + A[i]

O(

N

B

)

I/Os

N

B

A

Corollary Cache-oblivious selection requires O(N/B) I/OsHoare 1961 / Blum et al. 1973

Cache-Oblivious Algorithms and Data Structures 19

Page 29: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-ObliviousMatrix Multiplication

Cache-Oblivious Algorithms and Data Structures 20

Page 30: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix Multiplication

Problem

Z = X · Y , zij =N∑

k=1

xik · ykj

Layout of matrices

1 2 3 4 5 6 7

9 10 11 12 13 14 15

17 18 19 20 21 22 23

25 26 27 28 29 30 31

33 34 35 36 37 38 39

41 42 43 44 45 46 47

49 50 51 52 53 54 55

57 58 59 60 61 62 6356

48

40

32

24

16

8

0

12

40

36

32

44 45 46 47

41 42 43

37 38 39

33 34 35

9 10 118

4 5 6 7

1 2 30 1917 1816

20 21 22 23

25 26 2724

29 30 312813 14 15

49 50 5148

52 53 54 55

57 58 5956

60 61 62 63

0 1

2 3

4 5

6 7

8 9

10 11

12 13

14 15

16 17

18 19

20 21

22 23

24 25

26 27

28 29

30 31

32 33

34 35

36 37

38 39

40 41

43

44 45

46 47

48 49

50 51

52 53

54 55

56 57

58 59

60 61

6242 63

0

1

2

3

4

5

6

8 16 24 32 40 48

9

10

11

12

13

14

15

17

18

19

20

21

22

23

25

26

27

28

29

30

31

33

34

35

36

37

38

39

41

42

43

44

45

46

47

49

50

51

52

53

54

55

56

57

58

59

60

61

62

637

Row major Column major 4 × 4-blocked Bit interleaved

Cache-Oblivious Algorithms and Data Structures 21

Page 31: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 1: Nested loops

– Row major

for i = 1 to Nfor j = 1 to N

zij = 0for k = 1 to N

zij = zij + xik · ykj

– Reading a column of Y ⇒ N I/Os– Total O(N 3) I/Os

Algorithm 2: Blocked algorithm (cache-aware)

– Partition X and Y into blocks of size s × s, s

s

1 2 3 4 5 6 7

9 10 11 12 13 14 15

17 18 19 20 21 22 23

25 26 27 28 29 30 31

33 34 35 36 37 38 39

41 42 43 44 45 46 47

49 50 51 52 53 54 55

57 58 59 60 61 62 6356

48

40

32

24

16

8

0

s = Θ(√

M)

– Apply Algorithm 1 to the Ns× N

smatrices where

elements are s × s matrices– s × s-blocked or ( row major and M = Ω(B2) )

O(

(

Ns

)3· s2

B

)

= O(

N3

s·B

)

= O(

N3

B√

M

)

I/Os

– Optimal Hong & Kung, 1981

Cache-Oblivious Algorithms and Data Structures 22

Page 32: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 1: Nested loops

– Row major

for i = 1 to Nfor j = 1 to N

zij = 0for k = 1 to N

zij = zij + xik · ykj

– Reading a column of Y ⇒ N I/Os– Total O(N 3) I/Os

Algorithm 2: Blocked algorithm (cache-aware)

– Partition X and Y into blocks of size s × s, s

s

1 2 3 4 5 6 7

9 10 11 12 13 14 15

17 18 19 20 21 22 23

25 26 27 28 29 30 31

33 34 35 36 37 38 39

41 42 43 44 45 46 47

49 50 51 52 53 54 55

57 58 59 60 61 62 6356

48

40

32

24

16

8

0

s = Θ(√

M)

– Apply Algorithm 1 to the Ns× N

smatrices where

elements are s × s matrices

– s × s-blocked or ( row major and M = Ω(B2) )

O(

(

Ns

)3· s2

B

)

= O(

N3

s·B

)

= O(

N3

B√

M

)

I/Os

– Optimal Hong & Kung, 1981

Cache-Oblivious Algorithms and Data Structures 22

Page 33: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 1: Nested loops

– Row major

for i = 1 to Nfor j = 1 to N

zij = 0for k = 1 to N

zij = zij + xik · ykj

– Reading a column of Y ⇒ N I/Os– Total O(N 3) I/Os

Algorithm 2: Blocked algorithm (cache-aware)

– Partition X and Y into blocks of size s × s, s

s

1 2 3 4 5 6 7

9 10 11 12 13 14 15

17 18 19 20 21 22 23

25 26 27 28 29 30 31

33 34 35 36 37 38 39

41 42 43 44 45 46 47

49 50 51 52 53 54 55

57 58 59 60 61 62 6356

48

40

32

24

16

8

0

s = Θ(√

M)

– Apply Algorithm 1 to the Ns× N

smatrices where

elements are s × s matrices– s × s-blocked or ( row major and M = Ω(B2) )

O(

(

Ns

)3· s2

B

)

= O(

N3

s·B

)

= O(

N3

B√

M

)

I/Os

– Optimal Hong & Kung, 1981

Cache-Oblivious Algorithms and Data Structures 22

Page 34: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 1: Nested loops

– Row major

for i = 1 to Nfor j = 1 to N

zij = 0for k = 1 to N

zij = zij + xik · ykj

– Reading a column of Y ⇒ N I/Os– Total O(N 3) I/Os

Algorithm 2: Blocked algorithm (cache-aware)

– Partition X and Y into blocks of size s × s, s

s

1 2 3 4 5 6 7

9 10 11 12 13 14 15

17 18 19 20 21 22 23

25 26 27 28 29 30 31

33 34 35 36 37 38 39

41 42 43 44 45 46 47

49 50 51 52 53 54 55

57 58 59 60 61 62 6356

48

40

32

24

16

8

0

s = Θ(√

M)

– Apply Algorithm 1 to the Ns× N

smatrices where

elements are s × s matrices– s × s-blocked or ( row major and M = Ω(B2) )

O(

(

Ns

)3· s2

B

)

= O(

N3

s·B

)

= O(

N3

B√

M

)

I/Os

– Optimal Hong & Kung, 1981

Cache-Oblivious Algorithms and Data Structures 22

Page 35: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 3: Recursive algorithm (cache-oblivious)

X11 X12

X21 X22

Y11 Y12

Y21 Y22

=

X11Y11 + X12Y21 X11Y12 + X12Y22

X21Y11 + X22Y21 X21Y12 + X22Y22

– 8 recursive N2× N

2multiplications + 4 N

2× N

2matrix sums

– # I/Os if bit interleaved or ( row major and M = Ω(B2) )

T (N) ≤

O(N2

B) if N ≤ ε

√M

8 · T(

N2

)

+ O(

N2

B

)

otherwise

T (N) ≤ O

(

N 3

B√

M

)

– Optimal Hong & Kung, 1981

– Non-square matrices Frigo et al., 1999

Cache-Oblivious Algorithms and Data Structures 23

Page 36: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 3: Recursive algorithm (cache-oblivious)

X11 X12

X21 X22

Y11 Y12

Y21 Y22

=

X11Y11 + X12Y21 X11Y12 + X12Y22

X21Y11 + X22Y21 X21Y12 + X22Y22

– 8 recursive N2× N

2multiplications + 4 N

2× N

2matrix sums

– # I/Os if bit interleaved or ( row major and M = Ω(B2) )

T (N) ≤

O(N2

B) if N ≤ ε

√M

8 · T(

N2

)

+ O(

N2

B

)

otherwise

T (N) ≤ O

(

N 3

B√

M

)

– Optimal Hong & Kung, 1981

– Non-square matrices Frigo et al., 1999

Cache-Oblivious Algorithms and Data Structures 23

Page 37: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Matrix MultiplicationAlgorithm 3: Recursive algorithm (cache-oblivious)

X11 X12

X21 X22

Y11 Y12

Y21 Y22

=

X11Y11 + X12Y21 X11Y12 + X12Y22

X21Y11 + X22Y21 X21Y12 + X22Y22

– 8 recursive N2× N

2multiplications + 4 N

2× N

2matrix sums

– # I/Os if bit interleaved or ( row major and M = Ω(B2) )

T (N) ≤

O(N2

B) if N ≤ ε

√M

8 · T(

N2

)

+ O(

N2

B

)

otherwise

T (N) ≤ O

(

N 3

B√

M

)

– Optimal Hong & Kung, 1981

– Non-square matrices Frigo et al., 1999

Cache-Oblivious Algorithms and Data Structures 23

Page 38: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-Oblivious Matrix Multiplication

I/O bound

O

(

N 3

B√

M

)

Techniques applied

• Recursion / divide-and-conquer

• Recursive memory layout

• Scanning

Cache-Oblivious Algorithms and Data Structures 24

Page 39: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-Oblivious Search Trees

Cache-Oblivious Algorithms and Data Structures 25

Page 40: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Static Search TreesRecursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk

A

B1

A B1 Bk· · ·

· · ·

h

dh/2e

bh/2c

· · ·

· · ·· · · · · ·

· · ·

· · ·· · ·

· · ·

· · ·

· · ·· · ·

· · · · · ·

Searches use O(logB N) I/Os

Range reportings useO(

logB N + KB

)

I/Os

Best possible (log2 e + o(1)) logB NBender, Brodal, Fagerberg, Ge, He,

Hu, Iacono, López-Ortiz 2003

Cache-Oblivious Algorithms and Data Structures 26

Page 41: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Static Search TreesRecursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk

A

B1

A B1 Bk· · ·

· · ·

h

dh/2e

bh/2c

· · ·

· · ·· · · · · ·

· · ·

· · ·· · ·

· · ·

· · ·

· · ·· · ·

· · · · · ·

Searches use O(logB N) I/OsRange reportings useO(

logB N + KB

)

I/Os

Best possible (log2 e + o(1)) logB NBender, Brodal, Fagerberg, Ge, He,

Hu, Iacono, López-Ortiz 2003

Cache-Oblivious Algorithms and Data Structures 26

Page 42: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Static Search TreesRecursive memory layout ≡ van Emde Boas layout

Prokop 1999

Bk

A

B1

A B1 Bk· · ·

· · ·

h

dh/2e

bh/2c

· · ·

· · ·· · · · · ·

· · ·

· · ·· · ·

· · ·

· · ·

· · ·· · ·

· · · · · ·

Searches use O(logB N) I/OsRange reportings useO(

logB N + KB

)

I/Os

Best possible (log2 e + o(1)) logB NBender, Brodal, Fagerberg, Ge, He,

Hu, Iacono, López-Ortiz 2003

Cache-Oblivious Algorithms and Data Structures 26

Page 43: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Dynamic Search Trees

• Embed a dynamic tree of small height into a complete tree

• Static van Emde Boas layout

• Rebuild data structure whenever N doubles or halves

6

4

1

3

5

8

7 11

10 13

Search O(logB N)

Range Reporting O(

logB N + KB

)

Updates O(

logB N + log2 NB

)

Brodal, Fagerberg, Jacob 2001Cache-Oblivious Algorithms and Data Structures 27

Page 44: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Example

6

4

1

3

5

8

7 11

10 13

⇓6 4 8 1 − 3 5 − − 7 − − 11 10 13

Cache-Oblivious Algorithms and Data Structures 28

Page 45: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Binary Trees of Small Height

6

4

1

3

5

8

7 11

10 13

2 New

6

3

1

2

4

8

7 11

10 135

• If an insertion causes non-small height then rebuild subtreeat nearest ancestor with sufficient few descendants

• Insertions require amortized time O(log2 N)Andersson and Lai 1990

Cache-Oblivious Algorithms and Data Structures 29

Page 46: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-Oblivious Search Trees

I/O bounds

Search O (logB N)

Range Reporting O(

logB N + KB

)

Updates

O(

logB N + log2 NB

)

O (logB N) (no range reporting)

Techniques applied

• Recursion

• Recursive memory layout

• Scanning (updates, range reporting)

Cache-Oblivious Algorithms and Data Structures 30

Page 47: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-Oblivious Sorting

Cache-Oblivious Algorithms and Data Structures 31

Page 48: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Sorting Problem

3 4 8 2 8 4 0 4 4 6

K A T A J A I N E N

0 2 3 4 4 4 4 6 8 8

A A A E I J K N N T

Cache-Oblivious Algorithms and Data Structures 32

Page 49: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Binary Merge-Sort

2 8 0 4

8 4 4 6402 843

2 830 4 4 4 864

3 4

3 4 8 4 62 8 40 4

8 2 8 4 04 46

Merging

Merging

Merging

Ouput

Input

Merging

• Recursive; two arrays; size O(M) internally in cache

• O(N log N) comparisons • O(

NB

log2NM

)

I/Os

Cache-Oblivious Algorithms and Data Structures 33

Page 50: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Binary Merge-Sort

2 8 0 4

8 4 4 6402 843

2 830 4 4 4 864

3 4

3 4 8 4 62 8 40 4

8 2 8 4 04 46

Merging

Merging

Merging

Ouput

Input

Merging

• Recursive; two arrays; size O(M) internally in cache

• O(N log N) comparisons • O(

NB

log2NM

)

I/Os

Cache-Oblivious Algorithms and Data Structures 33

Page 51: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Merge-Sort

Degree I/O

2 O(

NB

log2NM

)

d O(

NB

logdNM

)

(d ≤ MB− 1)

Θ(

MB

)

O(

NB

logM/BNM

)

= O(SortM,B(N))Aggarwal and Vitter 1988

Funnel-Sort

2 O(1εSortM,B(N))

(M ≥ B1+ε) Frigo, Leiserson, Prokop and Ramachandran 1999Brodal and Fagerberg 2002

Cache-Oblivious Algorithms and Data Structures 34

Page 52: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Lower BoundBrodal and Fagerberg 2003

Block Size Memory I/Os

Machine 1 B1 M t1

Machine 2 B2 M t2

One algorithm, two machines, B1 ≤ B2

Trade-off

8t1B1 + 3t1B1 log8Mt2t1B1

≥ N logN

M− 1.45N

Cache-Oblivious Algorithms and Data Structures 35

Page 53: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Lower Bound

Assumption I/Os

LazyFunnel-sort B ≤ M 1−ε

(a) B2 = M 1−ε : SortB2,M(N)

(b) B1 = 1 : SortB1,M(N) · 1ε

BinaryMerge-sort B ≤ M/2

(a) B2 = M/2 : SortB2,M(N)

(b) B1 = 1 : SortB1,M(N) · log M

1 M1/2

M1−ε

M

B:

Penalty

Theorem This is tight. For any cache-oblivious comparisonbased sorting algorithm: (a) ⇒ (b)

Cache-Oblivious Algorithms and Data Structures 36

Page 54: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Funnel-Sort

Cache-Oblivious Algorithms and Data Structures 37

Page 55: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

k-mergerFrigo et al., FOCS’99

Sorted output stream

M

· · ·

k sorted input streams

=Recursive def.

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k← buffers of size k3/2

← k1/2-mergers

· · ·M0 M1B1 B√k

M√k

B2 M2

Recursive Layout

Cache-Oblivious Algorithms and Data Structures 38

Page 56: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

k-mergerFrigo et al., FOCS’99

Sorted output stream

M

· · ·

k sorted input streams

=Recursive def.

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k← buffers of size k3/2

← k1/2-mergers

· · ·M0 M1B1 B√k

M√k

B2 M2

Recursive Layout

Cache-Oblivious Algorithms and Data Structures 38

Page 57: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

k-mergerFrigo et al., FOCS’99

Sorted output stream

M

· · ·

k sorted input streams

=Recursive def.

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k← buffers of size k3/2

← k1/2-mergers

· · ·M0 M1B1 B√k

M√k

B2 M2

Recursive Layout

Cache-Oblivious Algorithms and Data Structures 38

Page 58: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Lazy k-mergerBrodal and Fagerberg 2002

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k →

Procedure Fill(v)while out-buffer not full

if left in-buffer emptyFill(left child)

if right in-buffer emptyFill(right child)

perform one merge step

Lemma

If M ≥ B2 and output buffer has size

k3 then O(k3

B logM (k3) + k) I/Os are

done during an invocation of Fill(root)

Cache-Oblivious Algorithms and Data Structures 39

Page 59: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Lazy k-mergerBrodal and Fagerberg 2002

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k →

Procedure Fill(v)while out-buffer not full

if left in-buffer emptyFill(left child)

if right in-buffer emptyFill(right child)

perform one merge step

Lemma

If M ≥ B2 and output buffer has size

k3 then O(k3

B logM (k3) + k) I/Os are

done during an invocation of Fill(root)

Cache-Oblivious Algorithms and Data Structures 39

Page 60: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Lazy k-mergerBrodal and Fagerberg 2002

B1

· · ·

· · ·

· · ·

M1 M√k

M0

B√k →

Procedure Fill(v)while out-buffer not full

if left in-buffer emptyFill(left child)

if right in-buffer emptyFill(right child)

perform one merge step

Lemma

If M ≥ B2 and output buffer has size

k3 then O(k3

B logM (k3) + k) I/Os are

done during an invocation of Fill(root)

Cache-Oblivious Algorithms and Data Structures 39

Page 61: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Funnel-Sort Brodal and Fagerberg 2002Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N1/3 segments of size N2/3

Recursively Funnel-Sort each segmentMerge sorted segments by an N 1/3-merger

k

N1/3

N2/9

N4/27

...

2

Theorem Funnel-Sort performs O(SortM,B(N)) I/Os for M ≥ B2

Cache-Oblivious Algorithms and Data Structures 40

Page 62: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Funnel-Sort Brodal and Fagerberg 2002Frigo, Leiserson, Prokop and Ramachandran 1999

Divide input in N1/3 segments of size N2/3

Recursively Funnel-Sort each segmentMerge sorted segments by an N 1/3-merger

k

N1/3

N2/9

N4/27

...

2

Theorem Funnel-Sort performs O(SortM,B(N)) I/Os for M ≥ B2

Cache-Oblivious Algorithms and Data Structures 40

Page 63: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Cache-Oblivious Sorting

I/O bounds

O(

N

BlogM/B

N

B

)

assuming a tall cache M = Ω(

B1+ε)

Techniques applied

• Divide-and-conquer

• Recursive memory layout (k-merger)

• Scanning (buffers)

• Buffers of double-exponential size

Cache-Oblivious Algorithms and Data Structures 41

Page 64: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Outline

• Motivation– A typical workstation– A trivial program

• Memory models– I/O model– Ideal cache model

• Basic cache-oblivious algorithms– Matrix multiplication– Search trees– Sorting

• Some experimental results

• Conclusion

Cache-Oblivious Algorithms and Data Structures 42

Page 65: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Memory Layouts of Trees

DFS

12

3

4 5

6

7 8

9

10

11 12

13

14 15

inorder

84

2

1 3

6

5 7

12

10

9 11

14

13 15

BFS

12

4

8 9

5

10 11

3

6

12 13

7

14 15

van Emde Boas

12

4

5 6

7

8 9

3

10

11 12

13

14 15

(in theory best)

Cache-Oblivious Algorithms and Data Structures 43

Page 66: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Searches in Pointer Based LayoutsBrodal, Fagerberg, Jacob 2002

0.0001

0.001

1000 10000 100000 1e+06

pointer bfspointer dfs

pointer vEBpointer random insertpointer random layout

• van Emde Boas layout wins, followed by the BFS layoutCache-Oblivious Algorithms and Data Structures 44

Page 67: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Searches with Implicit LayoutsBrodal, Fagerberg, Jacob 2002

0.0001

0.001

1000 10000 100000 1e+06

implicit bfsimplicit dfs

implicit vEBimplicit in-order

implicit 9-ary bfs

• BFS layout wins due to simplicity and caching of topmost levels• van Emde Boas layout requires quite complex index computations

Cache-Oblivious Algorithms and Data Structures 45

Page 68: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Searches with Implicit LayoutsBrodal, Fagerberg, Jacob 2002

1e-06

1e-05

0.0001

0.001

0.01

0.1

20 21 22 23 24 25 26 27 28 29

bfsveb

high1024

• van Emde Boas is competitive with a B-tree beyond main memory

Cache-Oblivious Algorithms and Data Structures 46

Page 69: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

8e-09

1e-08

1.2e-08

1.4e-08

1.6e-08

1.8e-08

2e-08

2.2e-08

2.4e-08

2.6e-08

2.8e-08

12 14 16 18 20 22 24 26 28

Wal

ltim

e/n*

log

n

log n

Uniform pairs - Itanium 2

funnelsort2GCC

msort-cmsort-m

Engineering a Cache-Oblivious Sorting Algorithm, Brodal, Fagerberg, Vinther, 2004

Cache-Oblivious Algorithms and Data Structures 47

Page 70: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Outline

• Motivation– A typical workstation– A trivial program

• Memory models– I/O model– Ideal cache model

• Basic cache-oblivious algorithms– Matrix multiplication– Search trees– Sorting

• Some experimental results

• Conclusion

Cache-Oblivious Algorithms and Data Structures 48

Page 71: Cache-Oblivious Algorithms and Data Structuresgerth/slides/swat04invited.pdfCache-Oblivious Algorithms and Data Structures 10 Algorithmic Problem Memory hierarchy has become a fact

Conclusion

• The ideal cache model helps designing competitive androbust algorithms

• Basic techniques: Scanning, recursion/divide-and-conquer,recursive memory layout, sorting

• Many other problems studied: Permuting, FFT,Matrix transposition, Priority queues, Graph algorithms,Computational geometry...

Overhead involved in being cache-oblivious canbe small enough for the nice theoretical proper-ties to transfer into practical advantages

Cache-Oblivious Algorithms and Data Structures 49