a data locality optimizing algorithm based on a data locality optimizing algorithm by michael e....

26
A Data Locality A Data Locality Optimizing Algorithm Optimizing Algorithm based on A Data Locality based on A Data Locality Optimizing Algorithm Optimizing Algorithm by Michael E. Wolf and Monica S. by Michael E. Wolf and Monica S. Lam Lam

Post on 21-Dec-2015

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

A Data Locality Optimizing A Data Locality Optimizing AlgorithmAlgorithm

based on A Data Locality Optimizing Algorithmbased on A Data Locality Optimizing Algorithm

by Michael E. Wolf and Monica S. Lamby Michael E. Wolf and Monica S. Lam

Page 2: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Outline

• Introduction• The Problem• Loop Transformation Theory

– Iteration Space

– Matrix Form of Loop Transformations

• The Localized Vector Space– Tiling

– Reuse and Locality

• The SRP Algorithm

Page 3: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Introduction

As processor speed continues to increase faster than memory speed, optimizations to use the memory hierarchy efficiently become ever more important.

Tiling is a well known technique that improves the data locality of numerical algorithms.

Page 4: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Let’s consider the example of matrix multiplication:

for I1 := 1 to n for I2 := 1 to n for I3 := 1 to n C[I1,I3] += A[I1,I2] * B[I2,I3]

for II2 := 1 to n by s for II3 := 1 to n by s for I1 := 1 to n for I2 := II2 to min(II2 + s - 1,n) for I3 := II3 to min(II3 + s - 1,n) C[I1,I3] += A[I1,I2] * B[I2,I3]

Introduction (cont.)

can bereused

Page 5: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The Problem

Matrix multiplication is a particularly simple example because it is both legal and advantageous to tile the entire nest.

In general, it is not always possible to tile the entire loop nest.

Let’s consider the example of an abstraction of hyperbolic PDE code:

for I1 := 0 to 5 for I2 := 0 to 6 A[I2 + 1] := 1/3 * (A[I2] + A[I2 + 1] + A[I2 + 2])

Page 6: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Loop Transformation Theory

In some cases when direct tiling is not applicable, we must use the loop transformations such as interchange, skewing and reversal. And for this we must construct some theory.

Page 7: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Iteration Space

In our model, a loop nest of depth n corresponds to a finite convex polyhedron of iteration space Z n, bounded by the loop bounds.

Each iteration in the loop corresponds to a node in the polyhedron, and is identified by its index vector

npppp ,,, 21

pi is the loop index of the i loop in the nest, counting from outermost to innermost.

Page 8: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

A dependence vector in an n-nested loop is denoted by a vector

ndddd ,,, 21

Each component di is possibly infinite range of integers,

maxmin , ii ddrepresented by , where

ZdZd iimaxmin , and .maxmin

ii dd

A single dependence vector therefore represents a set of distance vectors, called its distance vector set:

Zeeed in ,,1

and .maxminiii ded

Iteration Space (cont.)

Page 9: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

for I1 := 0 to 5 for I2 := 0 to 6 A[I2 + 1] := 1/3 * (A[I2] + A[I2 + 1] + A[I2 + 2])

Iteration Space (cont.)

I1

I2

So, the dependencies for our last example areD = {(0, 1), (1, 0), (1, –1)}

Page 10: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Matrix Form of Loop Transformations

With dependencies represented as vectors in the iteration space, loop transformations such as interchange, skewing and reversal, can be represented as matrix transformations.

For example, matrix form of loop interchange transformation that maps iteration (p1, p2) to iteration (p2, p1) is

1

2

2

1

01

10

p

p

p

p

Page 11: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Matrix Form of Loop Transformations (cont.)

Since a matrix transformation T is a linear transformation on the iteration space, . 1212 ppTpTpT

Therefore, if is a distance vector in the original iteration space, then is a distance vector in the transformed iteration space.

1

2

2

1

01

10

d

d

d

d

dTd

Page 12: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Interchange (Permutation)

A permutation on a loop nest transforms iteration (p1, …, pn)

nσσ pp ,,

1to . This transformation can be expressed in matrix

form as I, the n n identity matrix I with rows permuted by .

The loop interchange above is an n = 2 example of the general permutation transformation.

Page 13: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Reversal

Reversal of ith loop is represented by the identity matrix, but with the ith diagonal element equal to –1 rather than 1.

2

1

2

1

10

01

d

d

d

d

For example, loop reversal of the outermost loop of a two-deep loop nest is represented as

Page 14: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Skewing

Skewing loop Ij by an integer factor f with respect to loop Ii maps iteration

njjjiii pppppppp ,,,,,,,,,, 11111

to

njijjiii ppfppppppp ,,,,,,,,,, 11111

21

1

2

1

11

01

dd

d

d

d

And here the example of skewing of the innermost loop of a two-deep loop nest is

Page 15: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The Localized Vector Space

It is important to distinguish between reuse and locality. We say that a data item is reused if the same data is used in multiple iterations in a loop nest.

Thus reuse is a measure that is inherent in the computation and not dependent on the particular way the loops are written.

This reuse may not lead to saving a memory access if intervening iterations flush the data out of the cache between uses of data.

Page 16: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The Localized Vector Space (cont.)

Let’s consider the following example:

for I1 := 0 to n for I2 := 0 to n f(A[I1],A[I2])Here, reference A[I2] touches different data within the innermost loop, but reuses the same elements across the outer loop.

Page 17: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Tiling

In general, tiling transforms an n-deep loop nest into a 2n-deep loop nest where the inner n loops execute a compiler-determined number of iterations.

For example, the result of applying tiling to our example of an abstraction of hyperbolic PDE code will look as follows:

for II1 := 0 to 5 by 2 for II2 := 0 to 11 by 2 for I1 := II1 to min(II1 + 1, 5) for I2 := max(I1, II2) to min(6 + I1, II2 + 1) A[I2 - I1 + 1] := 1/3 * (A[I2 - I1]

+ A[I2 - I1 + 1] + A[I2 - I1 + 2])

Page 18: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Tiling (cont.)As the outer loop nests of tiled code controls the execution of the tiles, we will refer to them as the controlling loops. When we say tiling, we refer to the partitioning of the iteration space into rectangular blocks. Non-rectangular blocks are obtained by first applying unimodular transformations to the iteration space and then applying tiling.

for II1 := 0 to 5 by 2 for II2 := 0 to 11 by 2 for I1 := II1 to min(II1 + 1, 5) for I2 := max(I1, II2) to min(6 + I1, II2 + 1) A[I2 - I1 + 1] := 1/3 * (A[I2 - I1]

+ A[I2 - I1 + 1] + A[I2 - I1 + 2])

Page 19: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Tiling (cont.)

II2

II1

for II1 := 0 to 5 by 2 for II2 := 0 to 11 by 2 for I1 := II1 to min(II1 + 1, 5) for I2 := max(I1, II2) to min(6 + I1, II2 + 1) A[I2 - I1 + 1] := 1/3 * (A[I2 - I1]

+ A[I2 - I1 + 1] + A[I2 - I1 + 2])

Page 20: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Reuse and Locality

Since unimodular transformations and tiling can modify the localized vector space, knowing where there is reuse in the iteration space can help guide the search for the transformation that delivers the best locality.

Also, to choose between alternate transformations that exploit different reuses in a loop nest, we need a metric to quantify locality for a specific localized iteration space.

Page 21: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Types of Reuse

Reuse occurs when a reference within a loop accesses the same data location in different iterations. We call this self-temporal reuse.

Likewise, if a reference accesses data on the same cache line in different iterations, it is said to possess self-spatial reuse.

Furthermore, different references may access the same locations. We say that there is group-temporal reuse if the references refer to the same location, and group-spatial reuse if they refer to the same cache line.

Page 22: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The SRP Algorithm

Combining all together, we get the algorithm, which is known as SRP because the unimodular transformations it performs can be expressed as the product of a skew transformation (S), a reversal transformation (R) and a permutation transformation (P).

Page 23: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The SRP Algorithm

Let us illustrate SRP algorithm using our example of an abstraction of hyperbolic PDE code:

for I1 := 0 to 5 for I2 := 0 to 6 A[I2 + 1] := 1/3 * (A[I2] + A[I2 + 1] + A[I2 + 2])First an outer loop must be chosen. I1 can be the outer loop, because its dependence components are all non-negative. Now loop I2 has a dependence component that is negative, but it can be made non-negative by skewing with respect to I1.

Page 24: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The SRP Algorithm (cont.)

for I1 := 0 to 5 for I2 := I1 to 6 + I1

A[I2 – I1 + 1] := 1/3 * (A[I2 – I1] + A[I2 – I1 + 1] + A[I2 + 2])

I1

I2

Loop I2 is now placed in the same fully permutable nest as I1; the loop nest is tilable.

Page 25: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The SRP Algorithm (cont.)

for II2 := 0 to 11 by 2 for I1 := 0 to 5 for I2 := max(I1, II2) to min(6 + I1,II2 + 1) A[I2 – I1 + 1] := 1/3 * (A[I2 – I1] + A[I2 – I1 + 1] + A[I2 + 2])

I1

I2

Loop I2 is now placed in the same fully permutable nest as I1; the loop nest is tilable.

Page 26: A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

The End.