dl: data layout system for heterogeneous...
TRANSCRIPT
![Page 1: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/1.jpg)
DL: Data Layout System for
Heterogeneous Computing
I-Jui (Ray) Sung, Geng Daniel Liu, and Wen-Mei Hwu
University of Illinois at Urbana-Champaign
![Page 2: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/2.jpg)
Agenda
GPU Global Memory Throughput and Array-of-Structure
ASTA Layout
In-Place Conversion Between Layouts
2
![Page 3: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/3.jpg)
Global Memory Bandwidth
Ideal Reality
3
![Page 4: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/4.jpg)
GPU Memory Bandwidth vs. Stride
SAXPY with stride:
y[i * stride ] = a * x[ i * stride ] + y[i * stride ];
"Efficient Sparse Matrix-Vector Multiplication on CUDA"
Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",, December 2008
4
![Page 5: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/5.jpg)
Sources of Strided Accesses
Examples of strided accesses
Structure members of the same name in an array-of-structure
e.g. foo[0].bar and foo[1].bar
Elements in the same column in a row-majored array
e.g. A[1][2] and A[2][2]
Unit-strides can be achieved through transposition
5
![Page 6: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/6.jpg)
Array-of-Structures
Structure:
Array of Structures:
struct foo{
float a;
float b;
float c;
int d;
};
struct foo{
float a;
float b;
float c;
int d;
} A[8];
6
![Page 7: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/7.jpg)
Array-of-Structures
Many data parallel algorithms naturally take array-of-
structures
e.g. simulating temperature, pressure, velocity of the flow of a
cell in a regular grid
Computational Fluid
Dynamics Codes
Structural Engineering
Codes
Financial Engineering
Codes
7
![Page 8: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/8.jpg)
Array-of-Structures
Build an abstract view of related data
A common source of small strided accesses
Can we decouple the abstraction from the actual layout?
“The” actual layout?
Across components of a heterogeneous system?
GPU and CPU
Across nodes?
Shared memory machines? MPI?
8
![Page 9: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/9.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
9
![Page 10: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/10.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
10
![Page 11: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/11.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
11
a[8], b[8], and c[8] may be declared as separate arrays,
so the term SoA is used interchangeably with Discrete Arrays
![Page 12: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/12.jpg)
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Data grid that logically has multiple properties per cell
12
![Page 13: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/13.jpg)
Data grid that logically has multiple properties per cell
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Vector of
threads update
same property
across nearby
cells
13
![Page 14: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/14.jpg)
Data grid that logically has multiple properties per cell
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Prefers AoS
layout so
properties of
boundary cells
are consecutive
in memory
14
![Page 15: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/15.jpg)
Intuitive Solution
Map AoS dynamically to appropriate actual layouts to
fit different layout preferences in a heterogeneous system
? layout
transformation
15
![Page 16: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/16.jpg)
Intuitive Solution
This work is about the non-intuitive parts of the
seemingly intuitive solution:
What layout(s)?
How do we convert between layouts efficiently?
Efficiency as in Time and Space
When should we convert between layouts?
Use array-of-structures as a case study
16
![Page 17: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/17.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
17
![Page 18: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/18.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
divide into tiles
18
![Page 19: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/19.jpg)
Data Layout Alternatives
Array of
Structures
(AoS)
Array of
Structure of
Tiled Array
(ASTA)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
struct foo{
float a[4];
float b[4];
float c[4];
int d[4];
} A[2];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
19
![Page 20: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/20.jpg)
Performance of ASTA
As the default layout, ASTA is as good as Discrete Arrays
Advantages of ASTA are during in-place layout conversion:
Fast layout conversion (95GB/s) from/to AoS
AoS to/from SoA(DA)
Via ASTA: 8GB/s; Direct: <<8GB/s 20
0
2
4
6
8
10
12
LBM BlackScholes SpMV (bcsstk18)
Kernel Speedup on NVIDIA
GTX480
AOS
Discrete Arrays
ASTA(64)
ASTA(32)
ASTA(16)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
LBM BlackScholes SpMV (bcsstk18)
Kernel Speedup on ATI Radeon
HD5870
AOS
Discrete Arrays
ASTA(64)
ASTA(32)
ASTA(16)
![Page 21: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/21.jpg)
Layout Conversion and Transposition
Converting AoS to SoA is not too different from
transposing a tall and thin array
same as same as
transpose
AoS SoA
21
![Page 22: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/22.jpg)
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
22
![Page 23: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/23.jpg)
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
23
![Page 24: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/24.jpg)
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
data[i][j] = temp; //offset = i*W + j
24
![Page 25: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/25.jpg)
In-place Transpostion: First attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
data[i][j] = temp; //offset = i*W + j
Advantages:
Simple, Fast
Disadvantages
Scope of barrier() is work group
Limited by on-chip memory accessible to one work-group
25
![Page 26: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/26.jpg)
Layout Conversion and Transposition
Converting AoS to ASTA is not too different from
transposing a bunch of small tiles
The first attempt, barrier-sync, would more likely to work
same as same as
transpose
AoS ASTA
divide into tiles
transpose
26
![Page 27: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/27.jpg)
AoS to ASTA Transformation
AoS to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
Out-of-Place 80 2x Space
In-Place Barrier
Sync 95* Tile Size <
On-chip Memory
27
* Current results; results reported in Table 3 (~80GB/s) was measured on an earlier
implementation
What if tile size > on-chip memory capacity ?
![Page 28: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/28.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
28
![Page 29: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/29.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
(curr % N)*M + curr/N; curr next
M = 2, N = 5
29
![Page 30: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/30.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
0 0
30
![Page 31: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/31.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
1 2
31
![Page 32: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/32.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
{3, 6, 3} (curr % N)*M + curr/N; curr next
M = 2, N = 5
32
![Page 33: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/33.jpg)
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
{3, 6, 3}
{9}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
33
![Page 34: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/34.jpg)
Cycle Following – Original
Cycles:
thread 0: {0}
thread 1: {1, 2, 4, 8, 7, 5, 1}
thread 2: {3, 6, 3}
thread 3: {9}
This is equivalent to a straightforward parallelism of the IPT algorithm in Gustavson et al,
" In-place transposition of rectangular matrices." in PARA'06.
Imbalance
34
![Page 35: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/35.jpg)
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
35
![Page 36: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/36.jpg)
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
36
![Page 37: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/37.jpg)
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
37
![Page 38: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/38.jpg)
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
38
![Page 39: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/39.jpg)
AoS to ASTA Transformation
AoS to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
Out-of-Place 80 2x Space
In-Place Barrier
Sync 95* Tile Size <
On-chip Memory
In-Place Cycle
Following
14* Any tile size
39
* Current results; Table 3 in the paper was measured on an earlier implementation
![Page 40: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/40.jpg)
Layout Conversion and Transposition
Converting SoA to ASTA is not too different from
transposing a matrix with super-elements
The first attempt, barrier-sync, would still not work
same as
SoA
same as
ASTA
transpose super-elements
40
![Page 41: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/41.jpg)
SoA to ASTA Transformation
SoA to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
In-Place Barrier
Sync -- Does Not Work
In-Place Cycle
Following 9
ASTA(64): 17GB/s
ASTA(32): 9GB/s
ASTA(16): 4GB/s
41
![Page 42: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/42.jpg)
SoA to ASTA Transformation
0.44
13.65
0.85
3.33 2.40
27.35
21.95
25.06
0.85
4.75
9.17
15.10
4.81
10.02
17.13
4.63
9.25
17.11
4.63
9.30
17.07
4.62
9.32
17.12
4.68
9.56
17.40
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Su
stain
ed
Mem
ory B
an
dw
idth
(GB
/s)
Sparse Matrices, Tile Size
Original vs. Load Balanced Cycle Following Algorithms
Original
Load Balanced
42
![Page 43: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/43.jpg)
Summary
A new layout for AoS and tall arrays is proposed
Good locality on GPUs
Enables efficient in-place marshaling
Parallel in-place tiled transposition algorithms for
AoS/SoA ↔ ASTA are proposed
The tool itself is available upon request
A library implementation for ATI and NVIDIA in OpenCL and
CUDA is available at
https://bitbucket.org/ijsung/libmarshal
43
![Page 44: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/44.jpg)
Questions?
44
![Page 45: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/45.jpg)
Backup Slides
45
![Page 46: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/46.jpg)
Marshaling Overhead
Runtime In-place marshaling at transformation boundaries
GPU kernel invocation and CPU/GPU memory transfer
Parallel hi-throughput in-place transposition kernels
ASTA layout
46
![Page 47: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/47.jpg)
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
t0 t1 t2 t3 t4 t5 t6 t7
A
47
![Page 48: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/48.jpg)
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
48
![Page 49: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/49.jpg)
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
49
![Page 50: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/50.jpg)
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
50
![Page 51: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/51.jpg)
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
51
![Page 52: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/52.jpg)
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
52
![Page 53: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/53.jpg)
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
53
![Page 54: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/54.jpg)
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
54
![Page 55: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/55.jpg)
Array-of-Structures and Discrete Arrays
ATI (Evergreen) NVIDIA (Fermi)
GPU caches are too small to hold structure instances for
every executing wave-front
Future CPUs will have less cache per thread b/c energy limitations
55
![Page 56: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/56.jpg)
DRAM Bank Organization
• Each core array
has about 1M bits
• Each bit is stored
in a tiny capacitor,
made of one
transistor
![Page 57: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/57.jpg)
DRAM Bursting
![Page 58: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/58.jpg)
A very small (8x2 bit) DRAM bank
![Page 59: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/59.jpg)
DL for OpenCL
59
![Page 60: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/60.jpg)
Sources of Strided Accesses
When the stride is large (>103 bytes),
Problems are more on conflicting cache lines and DRAM banks
When the stride is small (<103 bytes),
Problem can sometimes be alleviated by having large cache
lines
60
![Page 61: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/61.jpg)
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
![Page 62: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/62.jpg)
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
![Page 63: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/63.jpg)
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
![Page 64: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/64.jpg)
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
t5 t6 t3 t4
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
![Page 65: DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages](https://reader034.vdocuments.site/reader034/viewer/2022050217/5f62b46ab2cbba7f564a8798/html5/thumbnails/65.jpg)
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
t5 t6 t3 t4
t=
4
t5 t7 t8 t4
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3