register blocking performanceyelick/papers/ppsc99.pdfperformance of blocked code (chosen size)...
TRANSCRIPT
1 2 3 4 5 6 7 8 9 10 11 1210
15
20
25
30
35
40
45
50
55Register blocking performance
columns in register block
Mflo
ps/s
ec.
1x 2x 3x 4x 5x 6x 7x 8x 9x10x11x12x
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
Performance of register blocked code on Ultrasparc
matrices
Mflo
ps/s
ec.
2x2(1.23) 2x2(1.35)2x1(1.00)
6x6(1.19)3x3(1.06)
3x3(1.00)3x3(1.00)
3x3(1.00)3x3(1.00)
2x2(1.21)2x2(1.21)
6x6(1.15)6x2(1.13)
2x2(1.33)2x1(1.17)
4x4(1.00)4x2(1.00)
2x2(1.98)2x1(1.71)
Base performance Performance of blocked code (chosen size)Performance of blocked code (best)
2x2(1.23)
matrix size(N) nonzeros Application area
Airfoil eigenvalue calculation2D PDE problemAutomobile frame stiffness matrix
4 13965 1.0M FEM stiffness matrixFEM stiffness matrixEngine block stiffness matrixStructure from shuttle rocket booster
Unstructured Euler solverDevice simulation
Chemical process separation
1 23560 484K 2 41092 1.7M 3 30237 1.5M
5 24696 1.8M 6 52329 2.7M 7 54870 2.7M
9 62424 1.7M 10 26068 177K
8 4134 94K
R
Ccache
cache
y
x
AAAAA00
AAAAA
AAAAA00
AAAAAAAA
0 00 0 00
00
0 00
00 0 0
00
0
0 0000
0
0
00
0 000
0 00 0
0
A
0603
14 17
21
022 25 26
30 34
42 43 46 47A51 54
61 62 65 67
72 73 74 75
00
A A A A A A A00
0
value
col_idx
03 21 22 30
00 3 1 2 6
06
4
14
=block_ptr
row_start
0 52 2 4 6 11 19 24... ...
8 16
0.05 0.1 0.15 0.2 0.250
2
4
6
8
10
12
14
16
18
20Performance of Static Cache blocking on Random Matrices
Density of Random Matrices
Perfo
rman
ce128x128 block 256x256 block 512x512 block 1024x1024 block 2048x2048 block 4096x4096 block 8192x8192 block 16384x16384 block32768x32768 block65536x65536 block
Proc 3Proc 2Proc1Proc0
C1 C2 C4
S1 S2 S3 S4
1 2 3 4 5 6 7 80
5
10
15
20
25
30
35Performance of cache blocking on SMP: 1
number of processors
Mflo
ps/p
roc.
base performanceC1 C2 C3 C4
1 2 3 4 5 6 7 80
5
10
15
20
25
30
35Performance of cache blocking on SMP: 5
number of processors
Mflo
ps/p
roc.
base performanceC1 C2 C3 C4
1 2 3 4 5 6 7 85
10
15
20
25
30
35
40Performance of reordering before blocking on SMP: 2
number of processors
Mflo
ps
base performance blocked performance blocked non−reordered perf. blocked rcm−reordered perf. blocked hmetis−reordered perf.
1 2 3 4 5 6 7 85
10
15
20
25
30
35
40Performance of reordering after blocking on SMP: 2
number of processors
Mflo
ps
base performance blocked performance blocked non−reordered perf. blocked rcm−reordered perf. blocked hmetis−reordered perf.