dl: data layout system for heterogeneous...

65
DL: Data Layout System for Heterogeneous Computing I-Jui (Ray) Sung, Geng Daniel Liu, and Wen-Mei Hwu University of Illinois at Urbana-Champaign

Upload: others

Post on 21-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

DL: Data Layout System for

Heterogeneous Computing

I-Jui (Ray) Sung, Geng Daniel Liu, and Wen-Mei Hwu

University of Illinois at Urbana-Champaign

Page 2: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Agenda

GPU Global Memory Throughput and Array-of-Structure

ASTA Layout

In-Place Conversion Between Layouts

2

Page 3: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Global Memory Bandwidth

Ideal Reality

3

Page 4: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

GPU Memory Bandwidth vs. Stride

SAXPY with stride:

y[i * stride ] = a * x[ i * stride ] + y[i * stride ];

"Efficient Sparse Matrix-Vector Multiplication on CUDA"

Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",, December 2008

4

Page 5: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Sources of Strided Accesses

Examples of strided accesses

Structure members of the same name in an array-of-structure

e.g. foo[0].bar and foo[1].bar

Elements in the same column in a row-majored array

e.g. A[1][2] and A[2][2]

Unit-strides can be achieved through transposition

5

Page 6: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures

Structure:

Array of Structures:

struct foo{

float a;

float b;

float c;

int d;

};

struct foo{

float a;

float b;

float c;

int d;

} A[8];

6

Page 7: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures

Many data parallel algorithms naturally take array-of-

structures

e.g. simulating temperature, pressure, velocity of the flow of a

cell in a regular grid

Computational Fluid

Dynamics Codes

Structural Engineering

Codes

Financial Engineering

Codes

7

Page 8: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures

Build an abstract view of related data

A common source of small strided accesses

Can we decouple the abstraction from the actual layout?

“The” actual layout?

Across components of a heterogeneous system?

GPU and CPU

Across nodes?

Shared memory machines? MPI?

8

Page 9: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

9

Page 10: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

10

Page 11: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

11

a[8], b[8], and c[8] may be declared as separate arrays,

so the term SoA is used interchangeably with Discrete Arrays

Page 12: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures

Example application of 1D LBM iterative CFD solver

GPU

Lattice-Boltzmann Kernel (updates one iteration)

CPU

Communication Thread (exchange boundary cells with other nodes via MPI)

Data grid that logically has multiple properties per cell

12

Page 13: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data grid that logically has multiple properties per cell

Array-of-Structures

Example application of 1D LBM iterative CFD solver

GPU

Lattice-Boltzmann Kernel (updates one iteration)

CPU

Communication Thread (exchange boundary cells with other nodes via MPI)

Vector of

threads update

same property

across nearby

cells

13

Page 14: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data grid that logically has multiple properties per cell

Array-of-Structures

Example application of 1D LBM iterative CFD solver

GPU

Lattice-Boltzmann Kernel (updates one iteration)

CPU

Communication Thread (exchange boundary cells with other nodes via MPI)

Prefers AoS

layout so

properties of

boundary cells

are consecutive

in memory

14

Page 15: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Intuitive Solution

Map AoS dynamically to appropriate actual layouts to

fit different layout preferences in a heterogeneous system

? layout

transformation

15

Page 16: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Intuitive Solution

This work is about the non-intuitive parts of the

seemingly intuitive solution:

What layout(s)?

How do we convert between layouts efficiently?

Efficiency as in Time and Space

When should we convert between layouts?

Use array-of-structures as a case study

16

Page 17: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

17

Page 18: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

divide into tiles

18

Page 19: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Data Layout Alternatives

Array of

Structures

(AoS)

Array of

Structure of

Tiled Array

(ASTA)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

struct foo{

float a[4];

float b[4];

float c[4];

int d[4];

} A[2];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

19

Page 20: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Performance of ASTA

As the default layout, ASTA is as good as Discrete Arrays

Advantages of ASTA are during in-place layout conversion:

Fast layout conversion (95GB/s) from/to AoS

AoS to/from SoA(DA)

Via ASTA: 8GB/s; Direct: <<8GB/s 20

0

2

4

6

8

10

12

LBM BlackScholes SpMV (bcsstk18)

Kernel Speedup on NVIDIA

GTX480

AOS

Discrete Arrays

ASTA(64)

ASTA(32)

ASTA(16)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

LBM BlackScholes SpMV (bcsstk18)

Kernel Speedup on ATI Radeon

HD5870

AOS

Discrete Arrays

ASTA(64)

ASTA(32)

ASTA(16)

Page 21: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Converting AoS to SoA is not too different from

transposing a tall and thin array

same as same as

transpose

AoS SoA

21

Page 22: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

In-place Transpostion: First Attempt

// data[W][H]-->data[H][W]

parallel for (j<W)

parallel for (i<H)

float temp = data[j][i]; //offset = j*H + i

22

Page 23: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

In-place Transpostion: First Attempt

// data[W][H]-->data[H][W]

parallel for (j<W)

parallel for (i<H)

float temp = data[j][i]; //offset = j*H + i

barrier();

23

Page 24: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

In-place Transpostion: First Attempt

// data[W][H]-->data[H][W]

parallel for (j<W)

parallel for (i<H)

float temp = data[j][i]; //offset = j*H + i

barrier();

data[i][j] = temp; //offset = i*W + j

24

Page 25: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

In-place Transpostion: First attempt

// data[W][H]-->data[H][W]

parallel for (j<W)

parallel for (i<H)

float temp = data[j][i]; //offset = j*H + i

barrier();

data[i][j] = temp; //offset = i*W + j

Advantages:

Simple, Fast

Disadvantages

Scope of barrier() is work group

Limited by on-chip memory accessible to one work-group

25

Page 26: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Converting AoS to ASTA is not too different from

transposing a bunch of small tiles

The first attempt, barrier-sync, would more likely to work

same as same as

transpose

AoS ASTA

divide into tiles

transpose

26

Page 27: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

AoS to ASTA Transformation

AoS to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

Out-of-Place 80 2x Space

In-Place Barrier

Sync 95* Tile Size <

On-chip Memory

27

* Current results; results reported in Table 3 (~80GB/s) was measured on an earlier

implementation

What if tile size > on-chip memory capacity ?

Page 28: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

28

Page 29: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

(curr % N)*M + curr/N; curr next

M = 2, N = 5

29

Page 30: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

(curr % N)*M + curr/N; curr next

M = 2, N = 5

0 0

30

Page 31: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}

(curr % N)*M + curr/N; curr next

M = 2, N = 5

1 2

31

Page 32: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}

{3, 6, 3} (curr % N)*M + curr/N; curr next

M = 2, N = 5

32

Page 33: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}

{3, 6, 3}

{9}

(curr % N)*M + curr/N; curr next

M = 2, N = 5

33

Page 34: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following – Original

Cycles:

thread 0: {0}

thread 1: {1, 2, 4, 8, 7, 5, 1}

thread 2: {3, 6, 3}

thread 3: {9}

This is equivalent to a straightforward parallelism of the IPT algorithm in Gustavson et al,

" In-place transposition of rectangular matrices." in PARA'06.

Imbalance

34

Page 35: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following – Load Balanced

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

35

Page 36: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following – Load Balanced

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

36

Page 37: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following – Load Balanced

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

37

Page 38: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following – Load Balanced

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

38

Page 39: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

AoS to ASTA Transformation

AoS to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

Out-of-Place 80 2x Space

In-Place Barrier

Sync 95* Tile Size <

On-chip Memory

In-Place Cycle

Following

14* Any tile size

39

* Current results; Table 3 in the paper was measured on an earlier implementation

Page 40: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Layout Conversion and Transposition

Converting SoA to ASTA is not too different from

transposing a matrix with super-elements

The first attempt, barrier-sync, would still not work

same as

SoA

same as

ASTA

transpose super-elements

40

Page 41: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

SoA to ASTA Transformation

SoA to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

In-Place Barrier

Sync -- Does Not Work

In-Place Cycle

Following 9

ASTA(64): 17GB/s

ASTA(32): 9GB/s

ASTA(16): 4GB/s

41

Page 42: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

SoA to ASTA Transformation

0.44

13.65

0.85

3.33 2.40

27.35

21.95

25.06

0.85

4.75

9.17

15.10

4.81

10.02

17.13

4.63

9.25

17.11

4.63

9.30

17.07

4.62

9.32

17.12

4.68

9.56

17.40

0.00

5.00

10.00

15.00

20.00

25.00

30.00

Su

stain

ed

Mem

ory B

an

dw

idth

(GB

/s)

Sparse Matrices, Tile Size

Original vs. Load Balanced Cycle Following Algorithms

Original

Load Balanced

42

Page 43: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Summary

A new layout for AoS and tall arrays is proposed

Good locality on GPUs

Enables efficient in-place marshaling

Parallel in-place tiled transposition algorithms for

AoS/SoA ↔ ASTA are proposed

The tool itself is available upon request

A library implementation for ATI and NVIDIA in OpenCL and

CUDA is available at

https://bitbucket.org/ijsung/libmarshal

43

Page 44: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Questions?

44

Page 45: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Backup Slides

45

Page 46: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Marshaling Overhead

Runtime In-place marshaling at transformation boundaries

GPU kernel invocation and CPU/GPU memory transfer

Parallel hi-throughput in-place transposition kernels

ASTA layout

46

Page 47: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Array-of-Structure microbenchmark

t0 t1 t2 t3 t4 t5 t6 t7

A

47

Page 48: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Array-of-Structure microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

48

Page 49: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Array-of-Structure microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

49

Page 50: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Array-of-Structure microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

50

Page 51: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Discrete Arrays microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

51

Page 52: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Discrete Arrays microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

52

Page 53: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Discrete Arrays microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

53

Page 54: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

In the Discrete Arrays microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

54

Page 55: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Array-of-Structures and Discrete Arrays

ATI (Evergreen) NVIDIA (Fermi)

GPU caches are too small to hold structure instances for

every executing wave-front

Future CPUs will have less cache per thread b/c energy limitations

55

Page 56: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

DRAM Bank Organization

• Each core array

has about 1M bits

• Each bit is stored

in a tiny capacitor,

made of one

transistor

Page 57: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

DRAM Bursting

Page 58: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

A very small (8x2 bit) DRAM bank

Page 59: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

DL for OpenCL

59

Page 60: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Sources of Strided Accesses

When the stride is large (>103 bytes),

Problems are more on conflicting cache lines and DRAM banks

When the stride is small (<103 bytes),

Problem can sometimes be alleviated by having large cache

lines

60

Page 61: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t=

3

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3

Page 62: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t=

3

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3

Page 63: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3

Page 64: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3

t5 t6 t3 t4

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3

Page 65: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3

t5 t6 t3 t4

t=

4

t5 t7 t8 t4

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3