optimizing graph algorithms for improved cache performance

70
Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon- Sang Park, Michael Penner, Viktor K Prasanna

Upload: akamu

Post on 30-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Optimizing Graph Algorithms for Improved Cache Performance. Aya Mire & Amir Nahir. Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park, Michael Penner, Viktor K Prasanna. 1. 2. The Problem with Graphs…. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing Graph Algorithms for Improved Cache Performance

Optimizing Graph Algorithms for Improved Cache

Performance

Aya Mire & Amir Nahir

Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park,

Michael Penner, Viktor K Prasanna

Page 2: Optimizing Graph Algorithms for Improved Cache Performance

The Problem with Graphs…

Graph problems pose unique challenges to improving cache performance due to their irregular data access patterns.

1

2

99

Page 3: Optimizing Graph Algorithms for Improved Cache Performance

Agenda

• A recursive implementation of the Floyd-Warshall Algorithm.

• A tiled implementation of the Floyd-Warshall Algorithm.

• Efficient data structures for general graph problem.

• Optimizations for the maximum matching algorithm.

Page 4: Optimizing Graph Algorithms for Improved Cache Performance

Analysis model

• All proofs and complexity analysis will be based on the I/O model.i.e: the goal of the improved algorithm is to minimize the number of cpu-memory transactions.

CPU

Cache

Main Memory

ABC

cost(A) ≪ cost(B)

cost(C) ≪ cost(B)

Page 5: Optimizing Graph Algorithms for Improved Cache Performance

Analysis model

All proofs will assume total control of the cache. i.e if the cache is big enough to hold two data blocks, than the two can be held in the cache without running over each other (no conflict misses)

Page 6: Optimizing Graph Algorithms for Improved Cache Performance

The Floyd Warshall Algorithm

• An ‘all pairs shortest path’ algorithm.• Works by iteratively calculating Dk,

where Dk is the matrix of all pair shortest paths going through vertices 1, 2, …k.

• Each iteration depends on the result of the previous one.

• Time complexity: Θ(|V|3).

Page 7: Optimizing Graph Algorithms for Improved Cache Performance

The Floyd Warshall Algorithm

Pseudo Code:for k from 1 to |V| for i from 1 to |V|

for j from 1 to |V| Di,j

(k) ← min Di,j(k-1) , Di,k

(k-1) + Dk,j(k-1)

return D(|V|)

Page 8: Optimizing Graph Algorithms for Improved Cache Performance

The Floyd Warshall Algorithm

The algorithm accesses the entire matrix in each iteration.

The dependency of the kth iteration on the results of the (k-1)th iteration eliminate the ability to perform data reuse.

Page 9: Optimizing Graph Algorithms for Improved Cache Performance

Lemma 1

Suppose Di,j(k) is computed as

Di,j(k) ← min Di,j

(k-1) , Di,k(k’) +

Dk,j(k’’)

for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Di,j(k) ← min Di,j

(k-1) , Di,k(k-1) + Di,k

(k-1)

Page 10: Optimizing Graph Algorithms for Improved Cache Performance

Lemma 1 - Proof

To distinguish between the traditional FW Algorithm, we’ll use Ti,j

(k) to denote the results calculated using the “new” computation way.

⇒ Ti,j(k) ← min Ti,j

(k-1) , Ti,k(k’) + Tk,j

(k’’)

for k-1 ≤ k’ , k’’ ≤ |V|

Suppose Di,j(k) is computed as

Di,j(k) ← min Di,j

(k-1) , Di,k(k’) + Dk,j

(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Page 11: Optimizing Graph Algorithms for Improved Cache Performance

Lemma 1 - Proof

First, we’ll show that for 1 ≤ k ≤ |V| the following inequality holds:

Ti,j(k) ≤ Di,j

(k)

We Prove this by induction.

Base case: by definition we haveTi,j

(0) = Di,j(0)

Page 12: Optimizing Graph Algorithms for Improved Cache Performance

Lemma 1 - Proof

Induction step: suppose Ti,j

(k) ≤ Di,j(k) for k = m-1. Then:

Ti,j(m) ← min Ti,j

(m-1) , Ti,m(m’) + Tm,j

(m’’)

≤ min Di,j(m-1) , Ti,m

(m’) + Tm,j(m’’)

≤ min Di,j(m-1) , Ti,m

(m-1) + Tm,j(m-1)

≤ min Di,j(m-1) , Di,m

(m-1) + Dm,j(m-1)

= Di,j(m)

for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j

(k)

Ti,j(k) ← min Ti,j

(k-1) , Ti,k(k’) + Ti,k

(k’’)

By step of induction

Limiting the choices for

intermediate vertices

makes path same or longer

By step of induction

By definition

Page 13: Optimizing Graph Algorithms for Improved Cache Performance

Lemma 1 - Proof

On the other hand, since the traditional algorithm computes the shortest paths at termination, and since Ti,j

(|V|) is the length of some path, we have:

Di,j(|V|) ≤ Ti,j

(|V|)

⇒ Di,j(|V|) = Ti,j

(|V|)

for 1 ≤ k ≤ |V| : Ti,j(k) ≤ Di,j

(k)

Suppose Di,j(k) is computed as

Di,j(k) ← min Di,j

(k-1) , Di,k(k’) + Dk,j

(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Page 14: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation

We first consider the basic case of a two-node graph.

w1

w2

1

2

-W1

W2-

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

Page 15: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation

The general case

I II

III IV

Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …

Page 16: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Recursive Algorithm –Correctness

It can be shown, that for each action Di,j

(k) ← min Di,j(k-1) , Di,k

(k-1) + Dk,j(k-1)

in FW’s traditional implementation, there is a corresponding action, Ti,j

(k) ← min Ti,j(k-1) , Ti,k

(k’) + Tk,j(k’’) ,

where k-1 ≤ k’ , k’’ ≤ |V|.Hence the algorithm’s correctness follows

from lemma 1.

Page 17: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Recursive Algorithm – How does it actually work…

TI(0) TII

(0)

TIV(0)TIII

(0)

Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …

T (0)T (|V|)

TI(|V|/2) TII

(|V|/2)

TIII(|V|/2) TIV

(|V|/2)TIV(|V|)TIII

(|V|)

TII(|V|)TI

(|V|)

Page 18: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Recursive Algorithm - Example

1

2

3

68

4

5

7

10

2

30

8

5

8

3

49

1

720

12345678

110502

23085

38

4

5

6349

71720

8

50

Page 19: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Recursive Algorithm – Example

12345678

1102

23085

38

4

5

6349

717

8

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

1850

2016

1-3-4

7-6-8

Page 20: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Recursive Algorithm – Example

12345678

1102

285

38

4

5

6349

717

8

Floyd-Warshall (T)

T11 = min T11, T11 + T11

T12 = min T12, T11 + T12

T21 = min T21, T21 + T11

T22 = min T22, T21 + T12

T22 = min T22, T22 + T22

T21 = min T21, T22 + T21

T12 = min T12, T12 + T22

T11 = min T11, T12 + T21

18

166

12

6 11

3011

31

7-2-47-6-5

11

7-2-87-6-4

10

1-6-81-6-52-6-51-6-4 52-6-4

Page 21: Optimizing Graph Algorithms for Improved Cache Performance

Representing the Matrix in an efficient way

We usually store matrices in the memory in one of two ways:

Using either of these layouts will not improve performance since the algorithm breaks the matrix into quadrants.

0123

4567

891011

12131415

0123456789101112131415

048121591326101437815

Column-major layout:

Row-major layout:

Page 22: Optimizing Graph Algorithms for Improved Cache Performance

Representing the Matrix in an efficient way

The Z-Morton layout:perform the following operations recursively until the quadrant size is of a single data unit:

divide the matrix into four quadrants.store quadrant I, II, III, IV in the memory.

For example:

0123

4567

891011

12131415

0145236789121310111415

Page 23: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis

The running time ofthe algorithm is givenby T(|V|) = 8·T(|V|/2) = Θ(|V|3)

Without considering Cache the number of cpu-memory transactionsis exactly as the running time

Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …

Page 24: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis - Theorem

There exists some B, where B = O(|cache|1/2), such that,

when using the FW-Recursive implementation, with the matrix stored in the Z-Morton layout, the number of cpu-memory transactions will be reduced by a factor of B.

⇒ there will be O(|V|3/B) cpu-memory transactions.

Page 25: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis

After k recursive calls, the size of a quadrant’s dimension is |V|/2k.

There exists some k, such that B ≜ |V|/2k and3 · B2 ≤ |cache|

Once the above condition is fulfilled, 3 matrices of size B2 can be placed in the cache, and no further cpu-memory transactions are required.

⇒B = O(|cache|1/2)

Floyd-Warshall (T)If (not base case) TI = min TI , TI , TI TII = min TII , TI , TII TIII = min TIII , TIII , TI TIV= min TIV , TIII , TII TIV = min TIV , TIV , TIV TIII = min TIII , TIV , TIII TII = min TII , TII , TIV TI = min TI , TII , TIII else …

Page 26: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis

Therefore we get:

O(|V|/B)3 · O(B2)

⇒ the number of cpu-memory transactions is reduced by a factor of B.

Transaction complexity of FW, when the size of the matrix dimension is

|V|/B, and there’s no cache

Transactions required in order to bring a BxB

quadrant into the cache

= O(|V|3/B)

Page 27: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis – lower bound

In “I/O complexity: The Red Blue Pebble Game” J.Hong and H.Kung have shown that the lower bound on cpu-memory transactions for multiplying matrices is Ω(N3/B)

where B = O(|cache|1/2)

Page 28: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis – lower bound – Theorem

The lower bound on cpu-memory transactions for the Floyd Warshall algorithm is Ω(|V|3/B)

where B = O(|cache|1/2)

Proof: by reduction

Page 29: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis – lower bound theorem - Proof

for k from 1 to N for i from 1 to N

for j from 1 to N Ck,i += Ak,j · Bj,I

|V|

|V||V|

Di,j

(k) ← min Di,j(k-

1) ,Di,k

(k-1) + Dk,j(k-1)

Page 30: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis - Conclusion

The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The recursive implementation is

asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

Page 31: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation - Comments

Note, that the size of the cache is not part of the algorithm’s parameters, neither it is needed in order to store the matrix in the Z-Morton layout.

Therefore:the algorithm is cache- oblivious

Page 32: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation - Comments

Though the analysis model included only a single hierarchy of cache, since no special attributes were defined, the proofs can be generalized to multiple levels of cache.

L0 Cache

Main Memory

L1 Cache

L2 Cache

Page 33: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation - Comments

Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

Page 34: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Recursive Implementation - ImprovementThe algorithm can be further improved

by making it cache conscious: performing the recursive calls until the problem size is reduced to B, and solving the B-size problem in the traditional way (saves recursive calls’ overhead)

This modification showed up to 2x improvement of running time on some of the machines.

Page 35: Optimizing Graph Algorithms for Improved Cache Performance
Page 36: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation

Consider a special case of lemma 1 when k’, k’’ are restricted such that

k - 1 ≤ k’, k’’ ≤ k - 1 + BWhere |cache| ≤ 3 · B2

( B = O(|cache|1/2))

Suppose Di,j(k) is computed as

Di,j(k) ← min Di,j

(k-1) , Di,k(k’) + Dk,j

(k’’) for k-1 ≤ k’ , k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

This leads to the following tiled implementation of FW’s algorithm

Page 37: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation

Divide the matrix into BxB tilesPerform |V|/B iterations:

during the tth iteration:I. update the (t,t)th blockII. update the remainder of the tth row

and tth columnIII. update the rest of the matrix

Page 38: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation

Each iterationconsists of threephases:

Phase I:performing FW’s algorithms on the (t,t)th tile

(which is self-dependent).

Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix

Page 39: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation

Phase II:updating theremainder of row t:

Ai,j(k) ← minAi,j

(k-1), Ai,k(tB) + Ak,j

(k-1)updating the remainder of column t:

Ai,j(k) ← minAi,j

(k-1), Ai,k(k-1) + Ak,j

(tB)

Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix

During the tth iteration, k goes from i·(B-1) to i·B

Page 40: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation

Phase III:updating the rest of the matrix:

Ai,j(k) ← minAi,j

(k-1), Ai,k(tB) + Ak,j

(tB)

Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix

During the tth iteration, k goes from i·(B-1) to i·B

Page 41: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Example

1

2

3

68

4

5

7

10

2

30

8

5

8

3

49

1

720

12345678

110502

23085

38

4

5

6349

71720

8

50

Page 42: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Example

12345678

110502

23085

38

4

5

6349

717

8

Divide the matrix into BxB tiles

Perform N/B iterations:

during the tth iteration:

I. update the (t,t)th block

II. update the remainder of

the tth row and tth column

III. update the rest of the

matrix206

7-2-8

Page 43: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Example

12345678

1102

23085

38

4

5

6349

7176

8

Divide the matrix into BxB tiles

Perform N/B iterations:

during the tth iteration:

I. update the (t,t)th block

II. update the remainder of

the tth row and tth column

III. update the rest of the

matrix

50181-3-4

Page 44: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Example

12345678

1102

285

38

4

5

6349

7176

8

Divide the matrix into BxB tiles

Perform N/B iterations:

during the tth iteration:

I. update the (t,t)th block

II. update the remainder of

the tth row and tth column

III. update the rest of the

matrix

6

12

11

1-6-52-6-5

7-6-5

1-6-4 5182-6-43011

111-6-8

7-6-410

Page 45: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Example

12345678

11052

21185

38

4

5

6349

7176

8

Divide the matrix into BxB tiles

Perform N/B iterations:

during the tth iteration:

I. update the (t,t)th block

II. update the remainder of

the tth row and tth column

III. update the rest of the

matrix

6

12

11

11

10

Page 46: Optimizing Graph Algorithms for Improved Cache Performance

Representing the Matrix in an efficient way

In order to match the data access pattern, a tile must be stored in continuous memory.

Therefore, the Z-Morton layout is used.

0123

4567

891011

12131415

0145236789121310111415

Page 47: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Tiled Algorithm –correctness

Let Di,j(k) be the result of the kth

iteration of the traditional FW’s implementation.

Even though Di,j(k) and Ai,j

(k) may not be equal during the “inner” iterations, it can be shown, using induction, that at the end of each iteration, Di,j

(k) = Ai,j

(k) (where k = t·B)

Page 48: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis - Theorem

There exists some B, where B = O(|cache|1/2), such that,

when using the FW-Tiled implementation, the number of cpu-memory transactions will be reduced by a factor of B.

⇒ there will be O(|V|3/B) cpu-memory transactions.

Page 49: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis

There are |V|/B x |V|/B tiles in the matrix.

There are |V|/B iterations in the algorithm, in each iteration, all tiles are accessed.

Updating a tile requires holding at most 3 tiles in the cache. ⇒ 3 · B2 ≤ |cache|

Divide the matrix into BxB tilesPerform N/B iterations: during the tth iteration: I. update the (t,t)th block II. update the remainder of the tth row and tth column III. update the rest of the matrix

Page 50: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis

Therefore we get:

(|V|/B) · [(|V|/B)x (|V|/B)] · O(B2)

⇒ the number of cpu-memory transactions is reduced by a factor of B.

The number of iterations

Transactions required in order to bring a BxB tile

into the cache

= O(|V|3/B)

The size of the matrix

Page 51: Optimizing Graph Algorithms for Improved Cache Performance

Complexity Analysis - Conclusion

The algorithm’s complexity: O(|V|3/B)Lower bound for FW: Ω(|V|3/B) The tiled implementation is

asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

Page 52: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation - Comments

Note, that when using the tiling method, the size of the cache is one of the algorithm’s parameters

Therefore:the tiled algorithm is cache - conscious

Page 53: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – Tiled Implementation - Comments

Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

Page 54: Optimizing Graph Algorithms for Improved Cache Performance

FW’s Algorithm – experimental results

Both algorithms (recursive and tiled) have shown a 30% improvement in L1 cache misses and 50% improvement in L2 cache misses for problem size of 1024 and 2048 vertices.

The results for both algorithms are nearly identical! (less than 1% difference)

Page 55: Optimizing Graph Algorithms for Improved Cache Performance

Dijkstra’s algorithm for Single Source Shortest Paths & Prim’s Algorithm for

Minimum Spanning Tree

Dijkstra’s Algorithm:S ← ∅Q ← VWhile Q ≠ ∅ u ← extract-min (Q) S ← S ∪ u for each v ∊ adj(u) update d[v]Return S

Prim’s Algorithm:Q ← Vfor each u ∊ Q do key(u) ← ∞key (root) ← 0While Q ≠ ∅ u ← extract-min (Q) for each v ∊ adj(u) if v ∊ Q and weight(u,v) < key(v) than key(v) ← weight(u,v)

Both Algorithms have the same data access pattern

Page 56: Optimizing Graph Algorithms for Improved Cache Performance

Graph representation

There are two commonly used graph representations.

The Adjacency matrix:A(i,j) = the cost of the edge from vertex i to vertex j.

Elements are accessed in adjacent fashion.

Representation size of O(|V|2)

Page 57: Optimizing Graph Algorithms for Improved Cache Performance

Graph representation

The adjacency list representation: a pointer-based representation where

a list of adjacent vertices is stored for each vertex in the graph, each node in the list holds the cost of the edge from the given vertex to the adjacent vertex.

Representation size of O(|V| + |E|) Pointer-based representation leads to

cache pollution.

Page 58: Optimizing Graph Algorithms for Improved Cache Performance

Adjacency Array representation

For each vertex in the graph, there exists an array of adjacent vertices.

Representation size of O(|V| + |E|)

Elements are accessed in adjacent fashion.

123… |V|

viwivjwj

Page 59: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph

Matching: A set M of edges in a graph is a

matching if no vertex of the graph is end of more than one edge in M.

A matching is considered maximum if it is larger than any other matching.

1

2

3

4

1 – 4 is a maximal matching

1 – 3 , 2 – 4 is a maximum matching

Page 60: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph

Let M be a matching.All edges in the graph are divided into

two groups: matching-edges and non-matching-edges.

A vertex is called free if it is not an end of any matching edge.

Page 61: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph

A path P = u0, e1, u1, … , ek, uk is called an augmenting path (with respect to M) if:

- u0 and uk are free.

- the even numbered edges e2, e4, …,

ek-1 are matching edges.

The set of edges M\e2,e4, …,ek-1 ∪ e1,e3, …,ek

is also a matching; it has one edge more than Mhas.So, if we find an augmenting path, we can construct

a larger matching.

Page 62: Optimizing Graph Algorithms for Improved Cache Performance

Finding Augmenting paths in a Bipartite Graph

In bipartite graphs, each augmenting path has one end in A and one end in B. following such augmenting path starting from its end in A, we traverse non-matching edges from A to B and matching edges from B to A.

By turning the graph into a directed graph (all matching edges are directed vB → vA, all the rest vA → vB), we turn the problem into a simple path finding problem in a directed graph.

Page 63: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph

The Algorithm:while (there exists an augmenting path)

increase |M| by one using the augmenting path

return MAlgorithm’s complexity: O(|V|·|E|)

Page 64: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph – first

optimizationIn order to find augmenting paths, we

use the BFS algorithm, which has similar data access pattern to that of Dijsktra/Prim.

Therefore, using the adjacency array instead of the adjacency list / matrix improves running time.

Page 65: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph – second

optimizationWe try to reduce the size of the

working set as in tiling:I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]

for each i ∊ 1,2,.. P using the basic algorithm.

III. Unite all sub-matches into M.IV. Find maximum matching in G using

basic algorithm (starting with M).

Page 66: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph – second

optimizationIf the sizes of sub-graphs are chosen

appropriately, each of which fits into the cache, it generates minimal cpu-memory transactions of O(|V| + |E|) during phase II, because a single loading of each data element into the cache is necessary.

Finding the best size for a sub-graph is by experiment.

Page 67: Optimizing Graph Algorithms for Improved Cache Performance

Matching Algorithm for Bipartite Graph – best case

In the best case, the maximum matching is found in phase II, the algorithm’s cpu-memory transactions complexity is O(|V| + |E|)

That leaves us with the problem of partition the graph optimally.

I. Partition G into g[1], g[2], … g[p].II. Find the maximum matching in g[i]

for each i ∊ 1,2,.. P using the basic

algorithm.III. Unite all sub-matches into M.IV. Find maximum matching in G using

basic algorithm (starting with M).

Page 68: Optimizing Graph Algorithms for Improved Cache Performance

Partitioning the Bipartite Graph

The goal: to partition the edges into two groups such that the best matching possible is found within each group.

Algorithm:I. Arbitrarily partition the vertices into 4

equal partitions.II. Count the number of edges between

each pair of partitions.III. Combine partitions into two partitions

such that as many “internal” edges as possible are created.

Page 69: Optimizing Graph Algorithms for Improved Cache Performance

Conclusions

Using efficient data representation methods can highly improve algorithms’ running time.

Further improvement can be achieved by methods as tiling and recursion.

Other graph algorithms, such as Bellman-Ford, BFS & DFS can be improved by the above, because of their data access pattern.

Page 70: Optimizing Graph Algorithms for Improved Cache Performance